[00:20] only 1,000,000 splinder users to go [00:24] Cool. Just over 1/4 done. [00:24] How big is splinder [00:25] So far, 217GB of warc, but I think alard sais some were being underreported [00:26] I'm running a du now on what I have, and seeing how it compares to the 3GB it shows on the dashboard. [00:30] Actually, it seems pretty close to right: I have 6GB of stuff, and that's WARC + normal files. [00:31] the size on the dashboard only counts the warc files (except that one of them wasn't being counted for a while, as you mentioned) [00:31] So, if that scales, there's been ~450GB downloaded so far. [00:32] And that's roughly 1/4 of the accounts we have in the list. Call it 2TB just to add some fudge factor. [00:32] OK. [00:32] Thanks. [00:33] No problem. Seems like it's going to be another one of those "didn't have $100 for an external drive?" kind of deals. [01:32] underscor: ping [01:32] actually, megh [01:32] underscor: need help with anyhub? [01:32] pong [01:32] If you have spare bw/disk, yeah [01:32] I'm spinning up another EC2 instance, need to purpoes it [01:32] ok [01:32] oh how I miss paying $100 on my AWS bills [01:33] hah [01:33] speaking of AWS, an Archive Team AMI would be pretty funny [01:34] and possible to some extent now that alard's done most of the work needed to get a standard spidering structure going [01:34] very nearly [01:34] currently all of the common code is shared between three or more projects [01:35] yeah, and we have a common output format now [01:35] hmmm. [01:35] warc ftw [01:35] although it's not perfect yet [01:36] splinder has three warc files per user [02:07] so what is splinder? [02:07] lol [02:07] jason is talking about it now on google hangout [02:08] italian geocities [02:59] looks like splinder has hit 10k users/hour [03:00] donbex is really knocking it out of the park [03:13] We're grabbing splinder at 10k/hour? [03:16] we peaked there just a little while ago [03:16] half an hour or 45 minutes ago [05:47] heh, in four hours I've grabbed 11 gigs of anyhub [05:57] hallo hallo [05:57] how are you all tonight? [05:59] uploading a bunch of OWS videos that got pulled off the internet -- 58 vids and ~1.6GB total [06:00] :D [06:00] Yeah, get on that [06:01] am doing it now [06:01] actually, is it cool to upload these in one big batch? [06:01] what's the etiquette on that? [06:02] Just do it [06:02] I work there. Just do it. [06:02] I prefer they be individual items [06:02] but feel free to upload each one after another [06:04] cool [06:04] thanks for your help =) [06:09] excellent, good work Fontaine! [06:09] thanks [07:07] Hey, [07:07] I enjoyed your speech on Archive Team at Defcon (via Youtube). I also worry about all the history we're throwing away. I've helped out on videogame archiving, particularly tape programs for the obscure Astrocade system. [07:07] Watching this also made me remember a site I saw about to shut down-- [07:07] http://www.mypodcast.com/ [07:07] "We must regretfully inform you that effective December 1, 2011, MyPodcast.com will cease all operations." etc., etc. [07:07] I didn't see it mentioned on the Archive Team site, so I figured I'd better send in a tip. There are quite a few podcasts on there that might not be available anywhere else. Fortunately they do at least have a directory. [07:07] http://www.mypodcast.com/podcast_directory.html [07:07] Keep up the good work! [07:08] OK, taking a shot at this. [07:08] awesome [07:10] here we go [07:13] Oh well played, all the podcasts are ALREADY deleted. [07:13] some seem to work, e.g. http://www.mypodcast.com/fsaudio/djbillyg_20110820_0101-751922.mp3 [07:14] Weird [07:14] I am having trouble getting stuff. [07:14] http://gundamcast.mypodcast.com/ [07:14] the pages are all deleted [07:14] everything but the choice of theme, apparently [07:14] http://radiohotncold.mypodcast.com/ [07:15] I've checked a bunch, and http://djbillyg.mypodcast.com/ is the only one I can find that actually has anything on it [07:15] http://afewlastwordswithgeorgecarlinandtonyhendra.mypodcast.com/ [07:15] hrm [07:15] A LOT are simply unused. [07:15] Well, someone take a shot at it. [07:15] perhaps it just wasn't very popular [07:15] I seem to have issues [07:16] wget --user-agent=fucksticks -e robots=off -mc -H -np http://mypodcast.com/ [07:17] Trying a different way, one moment. [07:17] Hate -m, it never usually works [07:19] --2011-11-16 23:19:49-- http://twihardsrus.mypodcast.com/ [07:19] 2011-11-16 23:19:49 ERROR 503: Service Unavailable. [07:19] HTTP request sent, awaiting response... 503 Service Unavailable [07:19] Resolving twihardsrus.mypodcast.com... 66.154.43.240 [07:19] Reusing existing connection to www.mypodcast.com:80. [07:19] SO many are like that [07:21] huh [07:21] yeah, -m isn't so hot [07:21] The solution is obvious. [07:21] http://www.podbean.com/ [07:21] Let's figure out how to download all of those [07:21] I tried that on twihardsrus and ended up redirected to www.dating.co.za [07:22] 419,162 PODCASTERS, 1,974,644 EPISODES, 184,725,988 DOWNLOADS AS OF TODAY. [07:30] so, 7,760 accounts on mypodcast [07:30] hmm [07:49] hey SketchCow [07:54] Heyyyyy [07:57] hi [07:58] I emailed you a week or two ago about wanting to dump some old newspaper scans somewhere, did that ever get to you? [07:59] they should go on archive.org, like everything else :) [08:00] yeah I just wasn't sure of the process [08:01] the images need to be stitched is the thing, so I was hoping to hand it off to someone who has the process down pat [10:48] ah, huh. [10:48] you have scans of newspaper, or of microfilm reels? [11:08] 02:02:40 < ATidlebot> underscor hit it off with a drunk sorority chick named Jenny! This wondrous godsend has accelerated them 0 days, 00:59:52 towards level 37. [11:08] good job underscor [11:15] what should I do in a case like this? http://toolserver.org/~nemobis/wget-phase-3-alsoit.splinder.com.log [11:15] - Downloading blog from alsoit.splinder.com... ERROR (3). [11:15] Error downloading 'it:alsoit'. [11:17] I think there are some 404 errors because I lost the connection for some minutes, or something like that [11:20] Nemo_bis: ./dld-single.sh Nemo it:alsoit to download it again. (Error 3 is somewhat more serious than 404 errors, it was probably a connection problem.) [11:21] alard, don't I lose what I downloaded? [11:21] Yes, you will. Is it large? [11:21] 342 MiB [11:21] Impossibile scrivere in "data/it/a/al/als/alsoit/files/www.splinder.com/myblog/comment/list/4012485/9175708" (Successo). [11:22] hmm [11:22] Is that the last url in the url list? [11:22] (And is your disk ok?) [11:23] perhaps the problem is that rsync used all io resources for some min and I ionice'd the download processes [11:24] The question is: is this download complete enough? [11:24] where do I see the list of URLs? [11:24] In the data/it/a/al/als/alsoit/ directory. But in this case, it looks like you were downloading a blog, which doesn't use an url list. [11:25] Did it already downloaded the tiosla blog? [11:25] (alsoit has two blogs.) [11:25] Should be in ls -la data/it/a/al/als/alsoit/ [11:26] ls data/it/a/al/als/alsoit/files/ [11:26] alsoit.splinder.com files.splinder.com syndication.splinder.com www.splinder.com [11:26] nothing about tiosla [11:26] And one directory higher? ls -la data/it/a/al/als/alsoit/ [11:27] blogs.txt splinder.com-alsoit-blog-alsoit.splinder.com.warc.gz wget-phase-1.log [11:27] files splinder.com-alsoit-html.warc.gz wget-phase-2.log [11:27] media-urls.txt splinder.com-alsoit-media.warc.gz wget-phase-3-alsoit.splinder.com.log [11:27] (The files/ are only from the blog it was downloading before it stopped.) [11:27] Hmm, no sign of the tiosla blog. [11:29] In that case, without modifying the script, there's no other solution than to redownload the whole thing. The script can't pick up from a half downloaded profile. [11:29] If you want, you could make a copy of the script and change it to skip the html and media downloads, and possible the alsoit blog. [11:31] aww [11:31] this is even worse: http://toolserver.org/~nemobis/wget-phase-3-night.splinder.com.log [11:31] 799 MiB [11:35] pfff, mobileme has much bigger users :) [11:36] * chronomex fires up his first splinder downloader [11:37] ;) [11:38] chronomex, in fact it's not a tragedy, I have users worth 3 GiB, could have been worse [11:39] I had a mobileme user that had so many files it caused wget-warc to allocate all my memory and crash [11:39] after 18 hours [11:39] hehe [11:39] and 16 gigabytes [11:39] when does it use so much memory? [11:39] mine seems to always write to disk [11:39] when there are 16 gigs worth of 100kbyte files [11:40] it keeps the urls around when in --mirror [11:40] 16 GiB of URLs?? [11:40] 16G of warc, my memory got full of urls I think [11:42] ah [11:45] 03:45:23 < ATidlebot> chronomex hit it off with a drunk sorority chick named Jenny! This wondrous godsend has accelerated them 0 days, 00:32:18 towards level 36. [11:45] jenny's really getting around [12:06] Hmm... at some point, every time I tried to download media (and the occasional blog) it says "done, with HTTP errors". [12:07] Is this something that's always happened, and now the new version is just telling me about it, or has something gone wrong? [12:08] I always had errors [12:09] I can't find a user without HTTP errors [12:10] OK. As long as I'm not hosing a bunch of profiles. [12:12] DoubleJ: That always happened, the new version is telling you about it. Some 404 errors are to be expected, since the script is generating these media urls (to try to get the largest version of photos, videos etc.). [15:28] Blorp [15:31] I'm downloading what IS there at mypodcast pretty well right now. [15:35] how is the podcast called? :P [17:53] chronomex: I have top down photos of the actual newspaper, not microfilm scans [17:55] 75G www.mypodcast.com [17:55] And growing. [17:55] So far, 2,349 individual episodes, which is at least saving something. [18:19] SketchCow: this group archiving that? [18:19] is archiving* [18:19] I am, yes. [18:20] Sometimes, it's just me. [18:20] Gotta keep my hand in it. [18:29] SketchCow: well I hope you're able to get it all. [18:45] root@teamarchive-0:/3/MYPODCAST/www.mypodcast.com/fsaudio# ls | wc -l [18:45] 2545 [18:45] 2652 [18:45] root@teamarchive-0:/3/MYPODCAST/www.mypodcast.com/fsaudio# ls | wc -l [18:45] It's coming along. [18:59] splinder question, if i started downloading a user but cannot finish it. can i simply kill the job and move on or do i need to tell someone/the server? [18:59] it:kiarablog is it [19:05] Schbirid: simply kill the job and restart it (see *Notes* on wiki page) [19:06] ciao donbex [19:07] hello Nemo_bis [19:07] donbex, a che punto sei, sei riuscito ad ammazzarne qualcuno? :) [19:07] donbex: i tried downloading that user two times, dont want to try agian [19:08] Schbirid, but you restart dld-client [19:09] Schbirid: users are assigned automatically (I think randomly, but I'm not shure about this) [19:09] ah, SketchCow, thank you for fixing spaces in the topic [19:09] you shouldn't get it again, at least for a while [19:10] Nemo_bis: cosa intendi? [19:10] btw, shouldn't we speak english here? [19:11] as you want donbex [19:11] I like multilingual channels [19:11] I mean if you reduced processes; but I see that you're still downloading at the same speed [19:13] yup, I did it [19:13] I simply left STOP there for a while :P [19:14] ah i see [19:14] thanks [19:14] Schbirid: you're welcome [19:16] donbex, does your script restard processes when they die of "natural death"? [19:16] for instance, many just died because they couldn't tell the tracker they completed a user [19:20] Nemo_bis: nope; right now it moves out the logfiles belonging to dead processes, though [19:23] it will not restart them, as I'm using this script after I stop the clients, too [19:24] yep, you just add some when they're not enough [19:24] what's the best number for you? [19:25] right now I'm running 50 of them, with ionice -n 6 [19:26] and I'm shure that 200 are definitely too many [19:27] The mypodcast download is going well. [19:27] They deleted a lot [19:27] or aren't getting servers up [19:27] BUT [19:28] the ones that are up? [19:28] 8mb/sec downloads [19:28] entire podcasts in a second. [19:28] nice [19:28] pew pew [19:29] but SketchCow, for some reason rsync doesn't manage to upload to batcave at moe than 300 KiB/s right now, is it my fault or what? [19:31] SketchCow: which podcast are you talking about? [19:33] SketchCow: didn't you have a vast hoard of old podcasts? Did that ever get published? [19:37] closure: I'm slowly backing up diggnation podcast [19:37] we need to archive dl.tv and crankygeeks.com [19:37] too big for me [19:46] Give me ways to download the ones that are too big, I can dupe them. [19:48] i was think you guys could just download all of it [19:49] i only download the ipod or h264 verison [19:50] donbex, guarda un po' che mi tocca vedere: Error downloading 'it:forzanuovacava'. -> anche wget ha una coscienza [19:54] Nemo_bis: lol [19:55] donbex, approposito, dobbiamo ricordarci di esportare il vecchio blog di SU [19:58] looks like each episode on crankygeeks has a number path [20:01] i figured out that ipod uses http://m.podshow.com/media/19831/ and h.264 uses http://m.podshow.com/media/19439/ path [20:02] mp3 uses 19004 and mobile 3gp uses 19903 [20:03] windows media format uses 19902 [20:12] chronomex: hahahahahaha [20:15] SketchCow: Does mypodcast need help? [20:15] (and are we actually gonna do podbean?) [20:28] underscor: ;) [20:28] Is podbean disappearing? [21:43] badass, anyhub's almost done [21:49] Yes, it's nearly time to think how to end this well. [21:49] There are a few claimed items from days ago that need to be re-added to the queue. [21:50] It might also be nice to re-do some of the most recent prefixes. [21:50] And I'm not entirely sure what to do with the items done by PepsiMax. [21:51] He seems to have lost a few that are marked done, and he may not have fixed some other early items. [21:53] i got a script that maybe good for getting crankygeeks [21:53] http://pastebin.com/nnPEsWqB [21:54] it doesn't download anything but it does give the right url path for at least upto 142 episodes [21:54] 133 was skipped cause its a repeat of 071 [21:55] 3gp verison on old url did happen until 112 [22:28] alard: I guess just enqueue them back up again [22:28] duplicates are better than nothing or corrupted, I guess [22:29] ^ [22:30] ha [22:31] speaking of anyhub, I just checked my EC2 monitoring stats for the instance I'm using as an anyhub grabber, and the CPU Utilization graph has been a constant 100% [22:31] over the last six hours [22:32] average disk writes is dead at 0.0 over the same time period, though [22:32] that can't be right [22:35] wow, 1/3rd done on splinder [22:35] alard: what would rock is if there were a way to verify what has been uploaded and requeue the rest [22:37] closure: That depends on where it's uploaded. (Is this anyhub or splinder, by the way?) [22:39] well, I mean in general [22:42] Verifying is possible, but it would mean that SketchCow would have to run scripts on batcave. (Verifying warcs and wget logs.) [22:43] hmm, anyhub.com/4z-t is a 996 MB file [22:43] damn [22:43] I'm guessing porn [22:55] I had a 2gb file yesterday :/ [23:00] RIP http://anyhub.com/ [23:01] alard: Well, i'm sending my stuff [23:02] yipdw: it's too late. [23:02] PepsiMax: Yes, but have you run ./fix-dld.sh on all of it? And checked for your disk full errors ( grep -l Cannot data/*/wget*.log ) [23:02] Day changed to 18 Nov 2011 [23:02] anyhub.net, wasn't it? [23:02] oh [23:02] imma hero myself [23:03] omg [23:03] quick [23:04] alard: I think it's in a good shape. [23:04] I started moving the data to you and I think it will do well. [23:07] Final anyhub countdown, 50 to go [23:08] fuu [23:09] alard: the last file is 4DlB [23:10] dekibytes [23:10] done. [23:11] wait, wut [23:11] I guess not [23:12] Cameron_D: as long as it is online, people will add files to it [23:12] Ah yes [23:12] Downloading prefix: 2EB [23:12] new ones are 4DlB and up [23:14] Getting next prefix from tracker... [23:14] No prefix. Sleeping for 30 seconds... [23:17] It would be cool to keep the clients running for a while, to pick up any leftovers that have been claimed but never marked done. [23:17] yeah [23:17] =D [23:17] Hi [23:18] hey Marcelo! [23:18] 122519552 14% 108.41kB/s 1:51:26 [23:18] uhh ADSL! [23:19] I'll upload now [23:29] haha [23:29] "AnyHub [23:29] users 38514done -9to do" [23:29] alard: 2A9-1, 3_q-1, 3_W-1, 4xH-1, 4xH-d1 and 4y5-1. [23:30] alard: 2Al is still incomplete, 2lK is still incomplete, 4d3 is still incomplete, 4hO is still incomplete, 4id is still incomplete, 4vn is still incomplete. [23:31] Yeah, I have ~10 incomplete ones, re-doing them now [23:31] dld-single? [23:31] yeah [23:31] oh, son of a bitch [23:32] my download for anyhub.com/4z-a blew up [23:32] EVERY OTHER PREFIX downloaded fine [23:32] that's a 2.4 gigabyte WARC that needs to be redone :( [23:32] ah well [23:32] not me. :P [23:32] I'll get on that once I finish fixing 4z6 [23:33] or someone else can do it, 4z6 is gonna be a whilw [23:33] while [23:33] I have a feeling some of these are quite large, if someone wants to do some let me know: http://pastebin.com/e0LrTGhA [23:36] '2lK' is done. [23:41] '4d3' is done. [23:43] Downloading prefix: 4hO try: 1 [23:43] 4hO is a error'd [23:43] 5.0M data/4hO [23:43] and i tell it to dld-single, still. [23:46] PepsiMax: 4h0 finished without problems for me. [23:49] OK, those ones on that list I posted are done/in progress so you can ignore them now [23:52] alard: 4vn seems to be stuck [23:52] same for 4id [23:55] anyhub.net-4vn_-20111114-1.warc.gz 377MB [23:55] anyhub.net-4vn_-20111118-1.warc.gz 6MB [23:58] alard: I think he meant four-h-capital-O [23:59] That was a 0 here [23:59] Ah, I see. [23:59] 0O [23:59] yeah, it was zero [23:59] requesting '3_W'