#archiveteam-bs 2012-09-22,Sat

↑back Search

Time Nickname Message
00:01 🔗 joepie91 will get to work in a bit
00:01 🔗 joepie91 :P
00:21 🔗 SketchCow Yes
01:34 🔗 joepie91 winr4r: starting on the scraper now
01:34 🔗 joepie91 let's see how long it takes to write it :P
01:36 🔗 winr4r :D
01:40 🔗 SketchCow About to pump magazines into http://archive.org/details/byte-magazine
01:43 🔗 winr4r woohoo!
01:53 🔗 SketchCow root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# du -sh .
01:53 🔗 SketchCow 33G .
01:53 🔗 SketchCow root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# ls | wc -l
01:53 🔗 SketchCow 128
01:56 🔗 winr4r slurp
01:58 🔗 SketchCow Here's what I plan to do.
01:58 🔗 SketchCow OK, then, 1986_03_BYTE_11-03_Homebound_Computing.pdf gets the love.
01:58 🔗 SketchCow I will add an item called byte-magazine-1986-03.
01:58 🔗 SketchCow I will say this dates to 1986-03.
01:58 🔗 SketchCow In the collection named byte-magazine...
01:58 🔗 SketchCow I will give it the title of Byte Magazine Volume 11 Number 03 - Homebound Computing.
01:58 🔗 SketchCow And here we go.
01:59 🔗 SketchCow > do
01:59 🔗 SketchCow > done
01:59 🔗 SketchCow > sh ingestor "$each"
01:59 🔗 SketchCow for each in *.pdf
01:59 🔗 SketchCow And ingestor does ALL the work.
02:00 🔗 SketchCow It's finished uploaded 8 already
02:06 🔗 SketchCow 30 uploaded.
02:06 🔗 SketchCow So not so bad.
02:06 🔗 SketchCow They'll start slowing down - some issues are 280-300mb
02:16 🔗 joepie91 [...]
02:16 🔗 joepie91 Archived 'Another night like this...', posted at 2005-02-06T15:42:00 by Devil's Kitchen
02:16 🔗 joepie91 Archived 'Joe Gordon', posted at 2005-01-13T21:45:00 by Devil's Kitchen
02:16 🔗 joepie91 Archived 'Toll Free...', posted at 2005-02-22T21:16:00 by Devil's Kitchen
02:16 🔗 joepie91 Archived 'Well, hello...', posted at 2005-01-13T21:26:00 by Devil's Kitchen
02:16 🔗 joepie91 Scraping http://www.devilskitchen.me.uk/2005_02_01_archive.html...
02:16 🔗 joepie91 that seems to go pretty well
02:16 🔗 joepie91 now to actually save it
02:20 🔗 winr4r woohoo!
02:34 🔗 SketchCow http://archive.org/details/byte-magazine-1985-01
02:34 🔗 SketchCow 319mb!!
02:34 🔗 joepie91 winr4r: scraping now
02:34 🔗 joepie91 at most
02:34 🔗 joepie91 shouldn't take much more than a minute or 2
02:34 🔗 winr4r huzzah
02:36 🔗 SketchCow Are you using something that blows it into .warc as well?
02:36 🔗 joepie91 lol, I was 403'd
02:37 🔗 joepie91 SketchCow: no, I'm actually parsing the archives pages
02:37 🔗 joepie91 archive *
02:41 🔗 joepie91 okay, let's try it again from another IP with a bit more delay inbetween >.>
02:42 🔗 joepie91 this will take a while :P
02:42 🔗 joepie91 SketchCow: output is JSON with post title, author name, posting date, and body
02:42 🔗 joepie91 body being the HTML of the particular post
02:43 🔗 joepie91 root@aarnist:~/devilskitchen# find -type f | wc -l
02:43 🔗 joepie91 so far
02:43 🔗 joepie91 140
02:45 🔗 joepie91 365...
02:45 🔗 joepie91 388...
02:46 🔗 joepie91 I've arrived at 2006 by now :P
02:51 🔗 joepie91 if anyone cares, scraper source: http://git.cryto.net/cgit/joepie91/tree/tools/scrapers/devilskitchen.py
02:51 🔗 joepie91 cc winr4r
02:51 🔗 joepie91 786 posts archived so far, around 2007-10 now
02:52 🔗 chronomex this is archivey enough for #archiveteam
02:52 🔗 chronomex imo
02:52 🔗 joepie91 mm... fair enough
02:52 🔗 joepie91 will move the convo there then :)
02:52 🔗 winr4r k
02:52 🔗 winr4r k
02:52 🔗 winr4r k
02:52 🔗 winr4r WHOAH
02:52 🔗 winr4r sorry, trying to write on my netbook in the dark :/
02:52 🔗 joepie91 winr4r: you're not in #archiveteam
02:53 🔗 joepie91 and lol
02:53 🔗 winr4r joepie91: i'm not
02:54 🔗 winr4r also, going to sleep with a cat tucked in behind my knees
02:54 🔗 joepie91 haha
02:55 🔗 joepie91 winr4r: you don't want to see the result then? :P
02:55 🔗 winr4r there is literally nothing in the world that feels better than this
02:55 🔗 joepie91 only 4 more years worth of posts to go
02:55 🔗 joepie91 heh
02:55 🔗 chronomex winr4r: sex is nice too.
02:55 🔗 winr4r joepie91: i really do, but as for me, and right now, there is me and my neighbour's cat
02:55 🔗 winr4r we're going to both sleep very well
02:55 🔗 joepie91 :P
02:55 🔗 winr4r gnight folks
02:55 🔗 joepie91 night
02:56 🔗 joepie91 goddamnit.
02:56 🔗 joepie91 403'd again.
02:57 🔗 joepie91 annoying.
03:19 🔗 joepie91 SketchCow: suggestions for places to upload the resulting scrape?
03:19 🔗 joepie91 fit for archive.org, for example?
03:24 🔗 joepie91 for reference, here is the full scrape (minus the pages that 403ed for some reason): http://aarnist.cryto.net:81/devilskitchen.tar.gz cc winr4r
03:25 🔗 chronomex joepie91: where's the .warc?
03:25 🔗 joepie91 chronomex: there is none
03:25 🔗 chronomex why not?
03:25 🔗 joepie91 because I scraped the actual blog posts, and not the site as a whole
03:26 🔗 chronomex just the content, not even the html?
03:26 🔗 joepie91 chronomex: as mentioned earlier, it has the title, author, date, and body of every blog post
03:26 🔗 joepie91 :P
03:26 🔗 joepie91 if you really want a .warc, feel free to run wget-warc, because I don't have it here
03:26 🔗 chronomex ah, ok
03:26 🔗 joepie91 it's a pretty small site anyway
03:27 🔗 chronomex have a list of urls I can work from?
03:27 🔗 joepie91 saving the archive pages suffices, because it doesn't shorten the articles
03:27 🔗 joepie91 sure, 1 sec
03:27 🔗 chronomex archive pages don't get comments :)
03:28 🔗 joepie91 http://pastie.org/4778385
03:28 🔗 joepie91 there you go
03:28 🔗 joepie91 correct
03:28 🔗 joepie91 but considering it's google, doing anything more is a bit tricky
03:28 🔗 joepie91 :/
03:28 🔗 joepie91 google is incredibly hostile towards scrapers and bots in my experience
03:28 🔗 chronomex :(
03:28 🔗 joepie91 it 403d my home IP for a short while (entirely, not just for a few pages)
03:28 🔗 joepie91 after I scraped with a 5 second interval
03:29 🔗 chronomex Disallow: /search
03:29 🔗 chronomex User-agent: *
03:29 🔗 chronomex Allow: /
03:29 🔗 chronomex LIES
03:29 🔗 joepie91 hmm? :P
03:29 🔗 chronomex in /robots.txt
03:30 🔗 joepie91 that doesn't make it not hostile towards bots/scrapers :)
03:30 🔗 chronomex not relevant: http://www.reddit.com/r/obots
03:30 🔗 chronomex hahahaha http://www.reddit.com/robots.txt
03:31 🔗 chronomex User-Agent: bender
03:31 🔗 chronomex Disallow: /my_shiny_metal_ass
03:31 🔗 chronomex Disallow: /earth
03:31 🔗 chronomex User-Agent: Gort
03:31 🔗 joepie91 lol
03:33 🔗 SketchCow joepie91: Get it all together and it has a home in the archiveteam collection at archive.org.
03:34 🔗 joepie91 SketchCow: right, I have a JSON dump of all the articles packed up here: http://aarnist.cryto.net:81/devilskitchen.tar.gz
03:34 🔗 joepie91 is that sufficient?
03:34 🔗 joepie91 title, author, date, body
03:35 🔗 SketchCow How many articles
03:35 🔗 joepie91 1114
03:52 🔗 SketchCow OK, so.
03:52 🔗 SketchCow you have acopy
03:52 🔗 SketchCow you really want a warc copy as well.
03:52 🔗 SketchCow You want a couple good copies, so we have something to work with in the future
03:52 🔗 SketchCow WARC is what archive.org wants, although it's clunky in contemporary space for now
04:02 🔗 joepie91 SketchCow:
04:02 🔗 joepie91 cat: css.c: No such file or directory
04:02 🔗 joepie91 make[3]: *** [css_.c] Error 1
04:02 🔗 joepie91 make[3]: Leaving directory `/root/wget-warc/trunk/src'
04:02 🔗 joepie91 when compiling wget-warc
04:02 🔗 joepie91 any suggestions?
04:04 🔗 joepie91 debian 6 btw
04:07 🔗 joepie91 ah, problem solved it seems
04:07 🔗 joepie91 apt-get install flex && ./configure && make
04:10 🔗 joepie91 help ._.
04:10 🔗 joepie91 make[2]: *** No rule to make target `Makevars', needed by `Makefile'. Stop.
04:13 🔗 joepie91 right, I think it works now
04:22 🔗 joepie91 finally found a command that does the job
04:22 🔗 joepie91 lol
04:24 🔗 joepie91 SketchCow: okay, wget-warc'ing the blog now, let's see if I get through without google banning me
04:24 🔗 joepie91 it ran against a no-index, so I had to ignore it
04:24 🔗 joepie91 er
04:24 🔗 joepie91 no-follow *
05:51 🔗 DFJustin <joepie91> SketchCow: going to a non-archived URL via wayback machine adds it to archive queue? <-- technically it doesn't add it to a queue, it just does a grab of the page right then
05:51 🔗 DFJustin + any prerequisites that your browser fetches
07:33 🔗 godane i may do a better pull of hackaday.com
07:33 🔗 godane mostly cause the images are not in warc.gz format
09:13 🔗 alard joepie91: The most recent Wget release (1.14) has warc support built-in. It looks like you've compiled an older version (one with a "trunk" directory), so it might be useful to upgrade if you plan to use it again.
10:14 🔗 winr4r joepie91: you're wonderful
10:14 🔗 winr4r good job
13:30 🔗 SketchCow Uploading a few hundred Laptop manuals
13:33 🔗 winr4r good morning jason!
13:33 🔗 winr4r and hello mistym
13:33 🔗 mistym Morning!
13:34 🔗 mistym Ugggh, why did it have to get so cold so fast? I mean it is Winnipeg, but... :/
13:35 🔗 winr4r it got much colder in the last couple of days here, too
13:40 🔗 joepie91 SketchCow, winr4r, tar.gz with both a warc and a json dump of the blog in it: http://aarnist.cryto.net:81/devilskitchen_final.tar.gz
13:40 🔗 joepie91 warc seems to have completed successfully
13:41 🔗 joepie91 (surprisingly)
13:45 🔗 winr4r joepie91: good job :)
13:53 🔗 SketchCow http://archive.org/details/archiveteam-devilskitchen-panic
14:03 🔗 winr4r yay!
14:08 🔗 joepie91 \o/
16:43 🔗 godane SketchCow: just for you to know i'm getting ~40000 exterinal images form my underground-gamer.com dump
16:43 🔗 godane also i think there is enough stuff in this dump just to do a talk on pirates again
19:24 🔗 joepie91 would you look at that, WHOIS data in JSON format :)
19:24 🔗 joepie91 http://whois.cryto.net/ :D
20:46 🔗 DFJustin on the subject of manual uploads, might as well toot my own horn http://archive.org/search.php?query=subject%3A%22computer%20history%22%20AND%20uploader%3A%22dopefishjustin%40gmail.com%22%20AND%20collection%3Aopensource&sort=-publicdate
20:57 🔗 dashcloud looks nice

irclogger-viewer