[00:06] I am testing on the cli with rsync [00:06] ok [00:06] i should spend some time with seesaw so i have an idea how it works :| [00:07] read the wiki. Some serious warrior and tracker content has been put up recently [00:07] excellent [00:07] there's also a bunch of pipelines in ArchiveTeam's git repos [00:07] http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior goes into the client side [00:08] as well as seesaw-kit heh [00:08] yeah i have an umber of them that i was running on my colo box [00:08] since running the vm there seemed a bit tedius [00:08] okay it just looks like my rsync target was off in the tracker by a dir [00:08] fixing it now [00:08] yay [00:09] okay one more time :) [00:09] pft: https://github.com/ArchiveTeam/seesaw-kit, just in case you haven't seen it yet [00:10] excellent [00:10] there's not many comments, but the code doesn't pull off any weird Python tricks [00:11] what fun would there be if there were comments? [00:12] Starting RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg [00:12] Uploading with Rsync to rsync://fos.textfiles.com/blooper/Levon/ [00:12] sending incremental file list [00:12] created directory /Levon [00:12] bam [00:12] aww shit :D [00:12] SketchCow, can you check the files we test uploaded for blooper please? [00:12] it's still going [00:12] bliptv-20131010-170943-00c6-00c6Episode1IslandC001713.mpeg 106430464 10% 2.11MB/s 0:06:56 [00:23] there it is [00:23] Tracker confirmed item 'http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg'. [00:23] ta da [00:35] 1.7G BLOOPER [00:35] du -sh BLOOPER [00:36] http://j42.video2.blip.tv/5560010393798/Potatono-OOPViews677.mp4 does bad things [00:38] Did someone upload that? [00:39] No. Wait a moment. [00:39] I'm downloading a 700mb video file [00:39] It's downloading fine. [00:39] Damn it is huge. [00:39] Like, hd [01:04] head's up: CS:GO just had a replay system added [01:04] you might think "what the hell does this do with archiving stuff"? [01:04] if it's anything similar to dota 2, replays expire after a while [01:10] It is not up on the warrior list yet but blip.tv is good to go [01:10] So if you want to run it via the command line just pull down https://github.com/ArchiveTeam/blip.tv-grab-video-only [01:10] the tracker has 10,000 items and the upload target is setup [01:21] is blip.tv going down ? [01:22] They are going to start erasing old content in 28 days [01:23] ahh ok [01:23] ouch, this is going to eat massive bandwidth [01:24] yes [01:24] kids, dont try this at home [01:25] any unlimited bandwidth vps's we can recommend ? [01:28] Pretty much any unlimited bandwith provider will limit your speed if you use a lot [01:29] *VPS provider [01:30] I still have ~500 GB of Friendster data sitting on my Dreamhost VPS. I really need to migrate that at some point, but they've never so much as hinted that it's a problem. [01:30] wow [01:30] I thought they absolutly "forbid" backup storage [01:31] guess 500gb isnt that much for them [01:31] Zebranky: really? I found ~300gb of old AT stuff on my DH VPS the other day :D [01:31] It fits into my overall usage pattern anyway. I'm one of their power users. :P [01:31] :) [01:31] But yeah, they don't seem to care one bit [02:00] do we have a channel for the blip.tv project ? [02:02] #blooper.tv [02:33] yipdw, is archivebot still pointing at glados server? [03:34] gaijinpot server is on its last legs [03:34] also the owner refuses to export the DB because of "privacy concerns" [03:34] my reminder that the users table could be blanked except for username/id was ignored [03:34] seems like theyre set on killing it. [03:35] i currently have 9000 threads archived with an ETA of 7 hours + additional time to redownload failed threads [03:36] ArchiveBot currently has 26,000 pages downloaded [05:37] gaijinpot unreachable again [05:45] every 10000 posts or so it craps out, gives a "serious error" and then resumes working. *sigh* [05:46] if anyone wants to read about what a giant fail vBulletin 5 is by the way: http://vbtruth.com/ [05:46] the story goes, Internet Brands (markers of that script) purposely make it suck to kill off communities [06:01] omf_: nope, I switched it to fos [06:01] cat___: how hard are you hitting it? [06:01] 25 simultaneous requests with a custom API scraper [06:02] they can't handle taht [06:02] I'm not surprised it's crapping out :P [06:02] ArchiveBot is doing one request at a time with a [0.05, 0.15] second wait between requests, and the duration between requests is way more than that wait [06:03] the problem is that im not sure if theyre going to keep it up till the 15th or not. the email exchange with the owner of it didn't seem very promising at all. :( [06:03] well [06:03] the more often it fails, the less complete a grab either of us are going to get [06:03] that's all I'm getting at [06:04] thats a fair point. [06:04] ill add a delay between API calls. [06:05] around ~13% of ArchiveBot's requests have failed, so that grab is only going to get about 80% of the URLs on that site [06:06] well, ~85% [06:09] ArchiveBot won't revisit failed pages? [06:09] no [06:09] well [06:10] wget does, but it doesn't retry on 5xx [06:10] more clearly: there is no special retry behavior coded in; it just uses wget's behavior [06:10] the idea is to just get a dump of a target in a standard format [06:11] retry could be added, though [06:14] might make sense to retry on 5xx errors. the rationale being: the site is shutting down, members are rushing to get their data out (which is overloading the server), and while that happens ArchiveBot is trying to make a dump. [06:14] obviously not important for GaijinPot, but I could think of a number of sites where a shutdown would lead to 5xx errors [06:15] the main purpose of the bot is to grab smaller sites that aren't crashing [06:15] or don't have a large userbase to take them down :P [06:15] it just happens though that GaijinPot is (a) going down, (b) interesting and (c) makes an interesting testacse [06:15] testcase [06:15] but yeah, retrying 5xxs does make sense [08:01] GaijinPot owners got back to me again. Asked me if I wanted to *buy* the forum + data from them... [08:02] I can understand they'd want money for the vB license, but the content? :/ [09:20] ha ha [09:21] once you finish scraping it, find a way to import it all into a new forum :P [09:33] the plan was to create a clone, yes [09:33] i already registered gaijinnot.com [09:33] Perfect.. [09:34] wouldve prefered gaijinbot, but someone took it >>; [09:35] "Live. Wank. Play. In Japan." [09:35] "About GaijinBot: Not much to say, really. Some cunt made me do it. I just wanted to listen to Pink Floyd." [09:39] theres a couple of spoofs like that; mostly banned members PMSing about [09:42] cat___: Are you keeping a watch so that data returned from the API are really alright data? Especially when it craps out and such [09:44] Yes, I've recently added some checks for certain sub-strings and luckily no files failed. Seems like either the response is correct or there is no response at all. For multi-page threads, the scraper can now check whether any are missing. [09:45] theres about 20000 threads left to get. Will run another check after that. [09:45] So does someone still think forums have some market value? I thought forums were basically dead across the web, more or less like Usenet. [09:45] forums are alive and well [09:46] id have to agree :x [09:46] this forum only died because the forum software upgrade made it suck [09:49] http://www.discourse.org/about/ [10:02] Nemo_bis: Yes, plenty still uses forums. It's not dead. Netcraft hasn't even said so yet [10:16] Discourse is amazing. [11:03] written in ruby, which my host doesnt support :( [11:03] 2:30 hrs left provided the server doesnt crap out again [11:03] * cat___ crosses fingers [11:47] aaaand it crapped out -.- [11:49] aaand it's back. [11:49] 11180 threads left [13:27] 3370 threads left [13:41] and just to verify cat___ each thread you grab has all the posts with it? [14:00] ugh I need to set the time in pipeline so the countdown on the warrior page is correct [14:06] do we want this as active project omf_ ? [14:08] ============================================================================= [14:08] Attention [14:08] Enjoy! :D [14:08] Leaderboard: http://tracker.archiveteam.org/bloopertv/ [14:08] We have a new warrior project up and running which is backing up all the videos on blip.tv [14:08] Wiki page: http://www.archiveteam.org/index.php?title=Blip.tv [14:08] ============================================================================= [14:11] Heh, I like the item names [14:12] I believe this is the simplest grab yet [14:12] Cloning the repo and starting a few instances now [14:12] ivan`, estimated this could easily be 60 terabytes [14:16] Mmmh, the uploading part is going to take a while at 60KB/s [14:33] where is it uploading to? [14:33] FOS [14:37] k [14:37] yeah I think we need a few more upload locations :D [14:40] omf_: yes [14:43] I just checked out the stats [14:43] network is not the problem [14:43] we are blowing out the memory on that machine [14:44] SketchCow, can you take a look at the Committed Memory chart on the stats page. [14:47] omf_: how much space is the warrior likely to need at any one moment for storing blip stuff? [14:47] I'm not sure if I'll be able to run it [14:47] without running out of disk [14:48] well we are talking video download here. So 100mb to 1gb per file, but it only does one file, then uploads [14:48] per job [14:48] so you can just turn the number of jobs to 1 if you want [14:49] I think the low upload speed for me, is as usual, some kind of peering problem. [15:09] yeah I just uploaded a few items no problem and my connection is dog slow [15:10] The memory thing is worrisome [15:56] redownloaded threads [15:56] redownloaded thread listing* [15:57] turns out 500 threads were missing [15:57] but it seems complete now [15:57] i spot-checked some newer and older threads and the posts were there [15:57] What's the compression like on the videos? [15:57] I imagine it's not very good compared to other primarily text archives [15:58] My download speed is approximately 24x my upload speed, so that's probably going to bottleneck me :s [16:30] You're probably right, Sellyme. There's no extra compression on the videos as they're already compressed, so the uploads are the same size as what's downloaded. [16:30] Ech [16:30] That's going to be fun. [16:30] I'm currently average a stellar 10.37KB/s upload [16:31] oo... [16:31] That being said, I am uploading 150,000 crawled website to Majestic-12, and running 8 simulteneous FTP transfers [16:31] Hopefully when they finish I can get ~80KB/s [16:32] These videos are certainly big - average 264mb each at the moment, some over 1gb or more. [16:34] I should be able to upload around 6GB/day [16:50] Hmmm? [16:55] SketchCow: Blip.tv downloader. [16:55] Yeah [16:55] FOS is doing quite a bit. Sorry to hear it's a problematic companion. [16:56] It's not "used" until it's ran into the ground. So.. I guess we're doing okay. [16:58] There we go, first upload complete~ [17:00] works fine here, I'm getting 2 MB/s upload to fos [17:02] er, heh [17:02] 10 MB/s [17:08] you are? heh, ok thats alright then [17:08] as long as it's fast for the people with nice connections [17:11] Yeah, not sure it's going to matter immensely if it's not too fast for me :P [17:14] http://socialfixer.com/blog/2013/10/05/facebook-requires-social-fixer-browser-extension-to-remove-key-features/ <- should the non-nerfed social fixer versions be archived? [17:19] I'm getting 60KB/s and I can easily do 1.2MB/s otherwise. [17:39] Lord_Nigh: archive ALL THE THINGS [17:44] Lord_Nigh: "What about AdBlock [...] I was told by the person at Facebook that she was not aware of [this app]" [17:44] ahahahahahaha bullshit [17:48] could someone check if gmail POP3 is down? dead for an hour here [17:49] First off, it's Google, so it's probably just you. [17:49] And it seems to be working fine for me, but I'm not really sure the best way to go about testing that [17:50] thunderbird is just throwing a zillion errors checking my gmail (when it even get DNS resolved) [17:50] other account are fine [18:44] WHAT [18:44] whoops [18:44] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [18:45] soooo not produced by a bot I guess [18:46] THY SECRET WORD is "yahoosucks" GO FORTH AND IMPART THY KNOWLEDGE [18:46] thanks [18:46] Yeah, we've had problems with spambots. Mediawiki is as hardened against that as a sieve is to holding water [21:39] ersi: or to holding data, for that matter [21:39] * joepie91 thinks Yahoo starts to look more and more like a sieve every week or so [22:28] Is there a save bebo channel? [23:30] so 7zip compressed all of gaijinpot down to ~218MB [23:30] originally it was around 11GB [23:30] now i get to write something that converts the "json-but-not-quite" API results to HTML pages [23:48] looks i found thelabwithleo flicker account: http://www.flickr.com/photos/labwithleo/ [23:53] here is roger chang flicker account: http://www.flickr.com/photos/rogerchang/ [23:58] this guy has some techtv stuff too: http://www.flickr.com/photos/mr_o/