[00:01] will get to work in a bit [00:01] :P [00:21] Yes [01:34] winr4r: starting on the scraper now [01:34] let's see how long it takes to write it :P [01:36] :D [01:40] About to pump magazines into http://archive.org/details/byte-magazine [01:43] woohoo! [01:53] root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# du -sh . [01:53] 33G . [01:53] root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# ls | wc -l [01:53] 128 [01:56] slurp [01:58] Here's what I plan to do. [01:58] OK, then, 1986_03_BYTE_11-03_Homebound_Computing.pdf gets the love. [01:58] I will add an item called byte-magazine-1986-03. [01:58] I will say this dates to 1986-03. [01:58] In the collection named byte-magazine... [01:58] I will give it the title of Byte Magazine Volume 11 Number 03 - Homebound Computing. [01:58] And here we go. [01:59] > do [01:59] > done [01:59] > sh ingestor "$each" [01:59] for each in *.pdf [01:59] And ingestor does ALL the work. [02:00] It's finished uploaded 8 already [02:06] 30 uploaded. [02:06] So not so bad. [02:06] They'll start slowing down - some issues are 280-300mb [02:16] [...] [02:16] Archived 'Another night like this...', posted at 2005-02-06T15:42:00 by Devil's Kitchen [02:16] Archived 'Joe Gordon', posted at 2005-01-13T21:45:00 by Devil's Kitchen [02:16] Archived 'Toll Free...', posted at 2005-02-22T21:16:00 by Devil's Kitchen [02:16] Archived 'Well, hello...', posted at 2005-01-13T21:26:00 by Devil's Kitchen [02:16] Scraping http://www.devilskitchen.me.uk/2005_02_01_archive.html... [02:16] that seems to go pretty well [02:16] now to actually save it [02:20] woohoo! [02:34] http://archive.org/details/byte-magazine-1985-01 [02:34] 319mb!! [02:34] winr4r: scraping now [02:34] at most [02:34] shouldn't take much more than a minute or 2 [02:34] huzzah [02:36] Are you using something that blows it into .warc as well? [02:36] lol, I was 403'd [02:37] SketchCow: no, I'm actually parsing the archives pages [02:37] archive * [02:41] okay, let's try it again from another IP with a bit more delay inbetween >.> [02:42] this will take a while :P [02:42] SketchCow: output is JSON with post title, author name, posting date, and body [02:42] body being the HTML of the particular post [02:43] root@aarnist:~/devilskitchen# find -type f | wc -l [02:43] so far [02:43] 140 [02:45] 365... [02:45] 388... [02:46] I've arrived at 2006 by now :P [02:51] if anyone cares, scraper source: http://git.cryto.net/cgit/joepie91/tree/tools/scrapers/devilskitchen.py [02:51] cc winr4r [02:51] 786 posts archived so far, around 2007-10 now [02:52] this is archivey enough for #archiveteam [02:52] imo [02:52] mm... fair enough [02:52] will move the convo there then :) [02:52] k [02:52] k [02:52] k [02:52] WHOAH [02:52] sorry, trying to write on my netbook in the dark :/ [02:52] winr4r: you're not in #archiveteam [02:53] and lol [02:53] joepie91: i'm not [02:54] also, going to sleep with a cat tucked in behind my knees [02:54] haha [02:55] winr4r: you don't want to see the result then? :P [02:55] there is literally nothing in the world that feels better than this [02:55] only 4 more years worth of posts to go [02:55] heh [02:55] winr4r: sex is nice too. [02:55] joepie91: i really do, but as for me, and right now, there is me and my neighbour's cat [02:55] we're going to both sleep very well [02:55] :P [02:55] gnight folks [02:55] night [02:56] goddamnit. [02:56] 403'd again. [02:57] annoying. [03:19] SketchCow: suggestions for places to upload the resulting scrape? [03:19] fit for archive.org, for example? [03:24] for reference, here is the full scrape (minus the pages that 403ed for some reason): http://aarnist.cryto.net:81/devilskitchen.tar.gz cc winr4r [03:25] joepie91: where's the .warc? [03:25] chronomex: there is none [03:25] why not? [03:25] because I scraped the actual blog posts, and not the site as a whole [03:26] just the content, not even the html? [03:26] chronomex: as mentioned earlier, it has the title, author, date, and body of every blog post [03:26] :P [03:26] if you really want a .warc, feel free to run wget-warc, because I don't have it here [03:26] ah, ok [03:26] it's a pretty small site anyway [03:27] have a list of urls I can work from? [03:27] saving the archive pages suffices, because it doesn't shorten the articles [03:27] sure, 1 sec [03:27] archive pages don't get comments :) [03:28] http://pastie.org/4778385 [03:28] there you go [03:28] correct [03:28] but considering it's google, doing anything more is a bit tricky [03:28] :/ [03:28] google is incredibly hostile towards scrapers and bots in my experience [03:28] :( [03:28] it 403d my home IP for a short while (entirely, not just for a few pages) [03:28] after I scraped with a 5 second interval [03:29] Disallow: /search [03:29] User-agent: * [03:29] Allow: / [03:29] LIES [03:29] hmm? :P [03:29] in /robots.txt [03:30] that doesn't make it not hostile towards bots/scrapers :) [03:30] not relevant: http://www.reddit.com/r/obots [03:30] hahahaha http://www.reddit.com/robots.txt [03:31] User-Agent: bender [03:31] Disallow: /my_shiny_metal_ass [03:31] Disallow: /earth [03:31] User-Agent: Gort [03:31] lol [03:33] joepie91: Get it all together and it has a home in the archiveteam collection at archive.org. [03:34] SketchCow: right, I have a JSON dump of all the articles packed up here: http://aarnist.cryto.net:81/devilskitchen.tar.gz [03:34] is that sufficient? [03:34] title, author, date, body [03:35] How many articles [03:35] 1114 [03:52] OK, so. [03:52] you have acopy [03:52] you really want a warc copy as well. [03:52] You want a couple good copies, so we have something to work with in the future [03:52] WARC is what archive.org wants, although it's clunky in contemporary space for now [04:02] SketchCow: [04:02] cat: css.c: No such file or directory [04:02] make[3]: *** [css_.c] Error 1 [04:02] make[3]: Leaving directory `/root/wget-warc/trunk/src' [04:02] when compiling wget-warc [04:02] any suggestions? [04:04] debian 6 btw [04:07] ah, problem solved it seems [04:07] apt-get install flex && ./configure && make [04:10] help ._. [04:10] make[2]: *** No rule to make target `Makevars', needed by `Makefile'. Stop. [04:13] right, I think it works now [04:22] finally found a command that does the job [04:22] lol [04:24] SketchCow: okay, wget-warc'ing the blog now, let's see if I get through without google banning me [04:24] it ran against a no-index, so I had to ignore it [04:24] er [04:24] no-follow * [05:51] SketchCow: going to a non-archived URL via wayback machine adds it to archive queue? <-- technically it doesn't add it to a queue, it just does a grab of the page right then [05:51] + any prerequisites that your browser fetches [07:33] i may do a better pull of hackaday.com [07:33] mostly cause the images are not in warc.gz format [09:13] joepie91: The most recent Wget release (1.14) has warc support built-in. It looks like you've compiled an older version (one with a "trunk" directory), so it might be useful to upgrade if you plan to use it again. [10:14] joepie91: you're wonderful [10:14] good job [13:30] Uploading a few hundred Laptop manuals [13:33] good morning jason! [13:33] and hello mistym [13:33] Morning! [13:34] Ugggh, why did it have to get so cold so fast? I mean it is Winnipeg, but... :/ [13:35] it got much colder in the last couple of days here, too [13:40] SketchCow, winr4r, tar.gz with both a warc and a json dump of the blog in it: http://aarnist.cryto.net:81/devilskitchen_final.tar.gz [13:40] warc seems to have completed successfully [13:41] (surprisingly) [13:45] joepie91: good job :) [13:53] http://archive.org/details/archiveteam-devilskitchen-panic [14:03] yay! [14:08] \o/ [16:43] SketchCow: just for you to know i'm getting ~40000 exterinal images form my underground-gamer.com dump [16:43] also i think there is enough stuff in this dump just to do a talk on pirates again [19:24] would you look at that, WHOIS data in JSON format :) [19:24] http://whois.cryto.net/ :D [20:46] on the subject of manual uploads, might as well toot my own horn http://archive.org/search.php?query=subject%3A%22computer%20history%22%20AND%20uploader%3A%22dopefishjustin%40gmail.com%22%20AND%20collection%3Aopensource&sort=-publicdate [20:57] looks nice