#archiveteam 2013-10-11,Fri

↑back Search

Time Nickname Message
00:06 🔗 omf_ I am testing on the cli with rsync
00:06 🔗 pft ok
00:06 🔗 pft i should spend some time with seesaw so i have an idea how it works :|
00:07 🔗 omf_ read the wiki. Some serious warrior and tracker content has been put up recently
00:07 🔗 pft excellent
00:07 🔗 yipdw there's also a bunch of pipelines in ArchiveTeam's git repos
00:07 🔗 omf_ http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior goes into the client side
00:08 🔗 yipdw as well as seesaw-kit heh
00:08 🔗 pft yeah i have an umber of them that i was running on my colo box
00:08 🔗 pft since running the vm there seemed a bit tedius
00:08 🔗 omf_ okay it just looks like my rsync target was off in the tracker by a dir
00:08 🔗 omf_ fixing it now
00:08 🔗 pft yay
00:09 🔗 omf_ okay one more time :)
00:09 🔗 yipdw pft: https://github.com/ArchiveTeam/seesaw-kit, just in case you haven't seen it yet
00:10 🔗 pft excellent
00:10 🔗 yipdw there's not many comments, but the code doesn't pull off any weird Python tricks
00:11 🔗 pft what fun would there be if there were comments?
00:12 🔗 pft Starting RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg
00:12 🔗 pft Uploading with Rsync to rsync://fos.textfiles.com/blooper/Levon/
00:12 🔗 pft sending incremental file list
00:12 🔗 pft created directory /Levon
00:12 🔗 pft bam
00:12 🔗 omf_ aww shit :D
00:12 🔗 omf_ SketchCow, can you check the files we test uploaded for blooper please?
00:12 🔗 pft it's still going
00:12 🔗 pft bliptv-20131010-170943-00c6-00c6Episode1IslandC001713.mpeg 106430464 10% 2.11MB/s 0:06:56
00:23 🔗 pft there it is
00:23 🔗 pft Tracker confirmed item 'http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg'.
00:23 🔗 pft ta da
00:35 🔗 SketchCow 1.7G BLOOPER
00:35 🔗 SketchCow du -sh BLOOPER
00:36 🔗 SketchCow http://j42.video2.blip.tv/5560010393798/Potatono-OOPViews677.mp4 does bad things
00:38 🔗 omf_ Did someone upload that?
00:39 🔗 SketchCow No. Wait a moment.
00:39 🔗 SketchCow I'm downloading a 700mb video file
00:39 🔗 SketchCow It's downloading fine.
00:39 🔗 SketchCow Damn it is huge.
00:39 🔗 SketchCow Like, hd
01:04 🔗 wp494 head's up: CS:GO just had a replay system added
01:04 🔗 wp494 you might think "what the hell does this do with archiving stuff"?
01:04 🔗 wp494 if it's anything similar to dota 2, replays expire after a while
01:10 🔗 omf_ It is not up on the warrior list yet but blip.tv is good to go
01:10 🔗 omf_ So if you want to run it via the command line just pull down https://github.com/ArchiveTeam/blip.tv-grab-video-only
01:10 🔗 omf_ the tracker has 10,000 items and the upload target is setup
01:21 🔗 BiggieJon is blip.tv going down ?
01:22 🔗 omf_ They are going to start erasing old content in 28 days
01:23 🔗 BiggieJon ahh ok
01:23 🔗 BiggieJon ouch, this is going to eat massive bandwidth
01:24 🔗 omf_ yes
01:24 🔗 BiggieJon kids, dont try this at home
01:25 🔗 BiggieJon any unlimited bandwidth vps's we can recommend ?
01:28 🔗 Cameron_D Pretty much any unlimited bandwith provider will limit your speed if you use a lot
01:29 🔗 Cameron_D *VPS provider
01:30 🔗 Zebranky I still have ~500 GB of Friendster data sitting on my Dreamhost VPS. I really need to migrate that at some point, but they've never so much as hinted that it's a problem.
01:30 🔗 BiggieJon wow
01:30 🔗 BiggieJon I thought they absolutly "forbid" backup storage
01:31 🔗 BiggieJon guess 500gb isnt that much for them
01:31 🔗 Cameron_D Zebranky: really? I found ~300gb of old AT stuff on my DH VPS the other day :D
01:31 🔗 Zebranky It fits into my overall usage pattern anyway. I'm one of their power users. :P
01:31 🔗 BiggieJon :)
01:31 🔗 Zebranky But yeah, they don't seem to care one bit
02:00 🔗 BiggieJon do we have a channel for the blip.tv project ?
02:02 🔗 omf_ #blooper.tv
02:33 🔗 omf_ yipdw, is archivebot still pointing at glados server?
03:34 🔗 cat___ gaijinpot server is on its last legs
03:34 🔗 cat___ also the owner refuses to export the DB because of "privacy concerns"
03:34 🔗 cat___ my reminder that the users table could be blanked except for username/id was ignored
03:34 🔗 cat___ seems like theyre set on killing it.
03:35 🔗 cat___ i currently have 9000 threads archived with an ETA of 7 hours + additional time to redownload failed threads
03:36 🔗 Cameron_D ArchiveBot currently has 26,000 pages downloaded
05:37 🔗 cat___ gaijinpot unreachable again
05:45 🔗 cat___ every 10000 posts or so it craps out, gives a "serious error" and then resumes working. *sigh*
05:46 🔗 cat___ if anyone wants to read about what a giant fail vBulletin 5 is by the way: http://vbtruth.com/
05:46 🔗 cat___ the story goes, Internet Brands (markers of that script) purposely make it suck to kill off communities
06:01 🔗 yipdw omf_: nope, I switched it to fos
06:01 🔗 yipdw cat___: how hard are you hitting it?
06:01 🔗 cat___ 25 simultaneous requests with a custom API scraper
06:02 🔗 yipdw they can't handle taht
06:02 🔗 yipdw I'm not surprised it's crapping out :P
06:02 🔗 yipdw ArchiveBot is doing one request at a time with a [0.05, 0.15] second wait between requests, and the duration between requests is way more than that wait
06:03 🔗 cat___ the problem is that im not sure if theyre going to keep it up till the 15th or not. the email exchange with the owner of it didn't seem very promising at all. :(
06:03 🔗 yipdw well
06:03 🔗 yipdw the more often it fails, the less complete a grab either of us are going to get
06:03 🔗 yipdw that's all I'm getting at
06:04 🔗 cat___ thats a fair point.
06:04 🔗 cat___ ill add a delay between API calls.
06:05 🔗 yipdw around ~13% of ArchiveBot's requests have failed, so that grab is only going to get about 80% of the URLs on that site
06:06 🔗 yipdw well, ~85%
06:09 🔗 cat___ ArchiveBot won't revisit failed pages?
06:09 🔗 yipdw no
06:09 🔗 yipdw well
06:10 🔗 yipdw wget does, but it doesn't retry on 5xx
06:10 🔗 yipdw more clearly: there is no special retry behavior coded in; it just uses wget's behavior
06:10 🔗 yipdw the idea is to just get a dump of a target in a standard format
06:11 🔗 yipdw retry could be added, though
06:14 🔗 cat___ might make sense to retry on 5xx errors. the rationale being: the site is shutting down, members are rushing to get their data out (which is overloading the server), and while that happens ArchiveBot is trying to make a dump.
06:14 🔗 cat___ obviously not important for GaijinPot, but I could think of a number of sites where a shutdown would lead to 5xx errors
06:15 🔗 yipdw the main purpose of the bot is to grab smaller sites that aren't crashing
06:15 🔗 yipdw or don't have a large userbase to take them down :P
06:15 🔗 yipdw it just happens though that GaijinPot is (a) going down, (b) interesting and (c) makes an interesting testacse
06:15 🔗 yipdw testcase
06:15 🔗 yipdw but yeah, retrying 5xxs does make sense
08:01 🔗 cat___ GaijinPot owners got back to me again. Asked me if I wanted to *buy* the forum + data from them...
08:02 🔗 cat___ I can understand they'd want money for the vB license, but the content? :/
09:20 🔗 SketchCow ha ha
09:21 🔗 Cameron_D once you finish scraping it, find a way to import it all into a new forum :P
09:33 🔗 cat___ the plan was to create a clone, yes
09:33 🔗 cat___ i already registered gaijinnot.com
09:33 🔗 GLaDOS Perfect..
09:34 🔗 cat___ wouldve prefered gaijinbot, but someone took it >>;
09:35 🔗 GLaDOS "Live. Wank. Play. In Japan."
09:35 🔗 GLaDOS "About GaijinBot: Not much to say, really. Some cunt made me do it. I just wanted to listen to Pink Floyd."
09:39 🔗 cat___ theres a couple of spoofs like that; mostly banned members PMSing about
09:42 🔗 ersi cat___: Are you keeping a watch so that data returned from the API are really alright data? Especially when it craps out and such
09:44 🔗 cat___ Yes, I've recently added some checks for certain sub-strings and luckily no files failed. Seems like either the response is correct or there is no response at all. For multi-page threads, the scraper can now check whether any are missing.
09:45 🔗 cat___ theres about 20000 threads left to get. Will run another check after that.
09:45 🔗 Nemo_bis So does someone still think forums have some market value? I thought forums were basically dead across the web, more or less like Usenet.
09:45 🔗 ivan` forums are alive and well
09:46 🔗 cat___ id have to agree :x
09:46 🔗 cat___ this forum only died because the forum software upgrade made it suck
09:49 🔗 ivan` http://www.discourse.org/about/
10:02 🔗 ersi Nemo_bis: Yes, plenty still uses forums. It's not dead. Netcraft hasn't even said so yet
10:16 🔗 GLaDOS Discourse is amazing.
11:03 🔗 cat___ written in ruby, which my host doesnt support :(
11:03 🔗 cat___ 2:30 hrs left provided the server doesnt crap out again
11:03 🔗 * cat___ crosses fingers
11:47 🔗 cat___ aaaand it crapped out -.-
11:49 🔗 cat___ aaand it's back.
11:49 🔗 cat___ 11180 threads left
13:27 🔗 cat___ 3370 threads left
13:41 🔗 omf_ and just to verify cat___ each thread you grab has all the posts with it?
14:00 🔗 omf_ ugh I need to set the time in pipeline so the countdown on the warrior page is correct
14:06 🔗 SmileyG do we want this as active project omf_ ?
14:08 🔗 omf_ =============================================================================
14:08 🔗 omf_ Attention
14:08 🔗 omf_ Enjoy! :D
14:08 🔗 omf_ Leaderboard: http://tracker.archiveteam.org/bloopertv/
14:08 🔗 omf_ We have a new warrior project up and running which is backing up all the videos on blip.tv
14:08 🔗 omf_ Wiki page: http://www.archiveteam.org/index.php?title=Blip.tv
14:08 🔗 omf_ =============================================================================
14:11 🔗 ersi Heh, I like the item names
14:12 🔗 omf_ I believe this is the simplest grab yet
14:12 🔗 ersi Cloning the repo and starting a few instances now
14:12 🔗 omf_ ivan`, estimated this could easily be 60 terabytes
14:16 🔗 ersi Mmmh, the uploading part is going to take a while at 60KB/s
14:33 🔗 SmileyG where is it uploading to?
14:33 🔗 omf_ FOS
14:37 🔗 SmileyG k
14:37 🔗 SmileyG yeah I think we need a few more upload locations :D
14:40 🔗 cat___ omf_: yes
14:43 🔗 omf_ I just checked out the stats
14:43 🔗 omf_ network is not the problem
14:43 🔗 omf_ we are blowing out the memory on that machine
14:44 🔗 omf_ SketchCow, can you take a look at the Committed Memory chart on the stats page.
14:47 🔗 joepie91 omf_: how much space is the warrior likely to need at any one moment for storing blip stuff?
14:47 🔗 joepie91 I'm not sure if I'll be able to run it
14:47 🔗 joepie91 without running out of disk
14:48 🔗 omf_ well we are talking video download here. So 100mb to 1gb per file, but it only does one file, then uploads
14:48 🔗 omf_ per job
14:48 🔗 omf_ so you can just turn the number of jobs to 1 if you want
14:49 🔗 ersi I think the low upload speed for me, is as usual, some kind of peering problem.
15:09 🔗 omf_ yeah I just uploaded a few items no problem and my connection is dog slow
15:10 🔗 omf_ The memory thing is worrisome
15:56 🔗 cat___ redownloaded threads
15:56 🔗 cat___ redownloaded thread listing*
15:57 🔗 cat___ turns out 500 threads were missing
15:57 🔗 cat___ but it seems complete now
15:57 🔗 cat___ i spot-checked some newer and older threads and the posts were there
15:57 🔗 Sellyme What's the compression like on the videos?
15:57 🔗 Sellyme I imagine it's not very good compared to other primarily text archives
15:58 🔗 Sellyme My download speed is approximately 24x my upload speed, so that's probably going to bottleneck me :s
16:30 🔗 antomatic You're probably right, Sellyme. There's no extra compression on the videos as they're already compressed, so the uploads are the same size as what's downloaded.
16:30 🔗 Sellyme Ech
16:30 🔗 Sellyme That's going to be fun.
16:30 🔗 Sellyme I'm currently average a stellar 10.37KB/s upload
16:31 🔗 antomatic oo...
16:31 🔗 Sellyme That being said, I am uploading 150,000 crawled website to Majestic-12, and running 8 simulteneous FTP transfers
16:31 🔗 Sellyme Hopefully when they finish I can get ~80KB/s
16:32 🔗 antomatic These videos are certainly big - average 264mb each at the moment, some over 1gb or more.
16:34 🔗 Sellyme I should be able to upload around 6GB/day
16:50 🔗 SketchCow Hmmm?
16:55 🔗 ersi SketchCow: Blip.tv downloader.
16:55 🔗 SketchCow Yeah
16:55 🔗 SketchCow FOS is doing quite a bit. Sorry to hear it's a problematic companion.
16:56 🔗 ersi It's not "used" until it's ran into the ground. So.. I guess we're doing okay.
16:58 🔗 Sellyme There we go, first upload complete~
17:00 🔗 yipdw works fine here, I'm getting 2 MB/s upload to fos
17:02 🔗 yipdw er, heh
17:02 🔗 yipdw 10 MB/s
17:08 🔗 SmileyG you are? heh, ok thats alright then
17:08 🔗 SmileyG as long as it's fast for the people with nice connections
17:11 🔗 Sellyme Yeah, not sure it's going to matter immensely if it's not too fast for me :P
17:14 🔗 Lord_Nigh http://socialfixer.com/blog/2013/10/05/facebook-requires-social-fixer-browser-extension-to-remove-key-features/ <- should the non-nerfed social fixer versions be archived?
17:19 🔗 ersi I'm getting 60KB/s and I can easily do 1.2MB/s otherwise.
17:39 🔗 SmileyG Lord_Nigh: archive ALL THE THINGS
17:44 🔗 Sellyme Lord_Nigh: "What about AdBlock [...] I was told by the person at Facebook that she was not aware of [this app]"
17:44 🔗 Sellyme ahahahahahaha bullshit
17:48 🔗 Cowering could someone check if gmail POP3 is down? dead for an hour here
17:49 🔗 Sellyme First off, it's Google, so it's probably just you.
17:49 🔗 Sellyme And it seems to be working fine for me, but I'm not really sure the best way to go about testing that
17:50 🔗 Cowering thunderbird is just throwing a zillion errors checking my gmail (when it even get DNS resolved)
17:50 🔗 Cowering other account are fine
18:44 🔗 Vito` WHAT
18:44 🔗 Vito` whoops
18:44 🔗 Vito` WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
18:45 🔗 Vito` soooo not produced by a bot I guess
18:46 🔗 ersi THY SECRET WORD is "yahoosucks" GO FORTH AND IMPART THY KNOWLEDGE
18:46 🔗 Vito` thanks
18:46 🔗 ersi Yeah, we've had problems with spambots. Mediawiki is as hardened against that as a sieve is to holding water
21:39 🔗 joepie91 ersi: or to holding data, for that matter
21:39 🔗 * joepie91 thinks Yahoo starts to look more and more like a sieve every week or so
22:28 🔗 arkhive Is there a save bebo channel?
23:30 🔗 cat___ so 7zip compressed all of gaijinpot down to ~218MB
23:30 🔗 cat___ originally it was around 11GB
23:30 🔗 cat___ now i get to write something that converts the "json-but-not-quite" API results to HTML pages
23:48 🔗 godane looks i found thelabwithleo flicker account: http://www.flickr.com/photos/labwithleo/
23:53 🔗 godane here is roger chang flicker account: http://www.flickr.com/photos/rogerchang/
23:58 🔗 godane this guy has some techtv stuff too: http://www.flickr.com/photos/mr_o/

irclogger-viewer