#archiveteam-bs 2012-09-22,Sat

↑back Search

Time	Nickname	Message
00:01 ^🔗	joepie91	will get to work in a bit
00:01 ^🔗	joepie91	:P
00:21 ^🔗	SketchCow	Yes
01:34 ^🔗	joepie91	winr4r: starting on the scraper now
01:34 ^🔗	joepie91	let's see how long it takes to write it :P
01:36 ^🔗	winr4r	:D
01:40 ^🔗	SketchCow	About to pump magazines into http://archive.org/details/byte-magazine
01:43 ^🔗	winr4r	woohoo!
01:53 ^🔗	SketchCow	root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# du -sh .
01:53 ^🔗	SketchCow	33G .
01:53 ^🔗	SketchCow	root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# ls \| wc -l
01:53 ^🔗	SketchCow	128
01:56 ^🔗	winr4r	slurp
01:58 ^🔗	SketchCow	Here's what I plan to do.
01:58 ^🔗	SketchCow	OK, then, 1986_03_BYTE_11-03_Homebound_Computing.pdf gets the love.
01:58 ^🔗	SketchCow	I will add an item called byte-magazine-1986-03.
01:58 ^🔗	SketchCow	I will say this dates to 1986-03.
01:58 ^🔗	SketchCow	In the collection named byte-magazine...
01:58 ^🔗	SketchCow	I will give it the title of Byte Magazine Volume 11 Number 03 - Homebound Computing.
01:58 ^🔗	SketchCow	And here we go.
01:59 ^🔗	SketchCow	> do
01:59 ^🔗	SketchCow	> done
01:59 ^🔗	SketchCow	> sh ingestor "$each"
01:59 ^🔗	SketchCow	for each in *.pdf
01:59 ^🔗	SketchCow	And ingestor does ALL the work.
02:00 ^🔗	SketchCow	It's finished uploaded 8 already
02:06 ^🔗	SketchCow	30 uploaded.
02:06 ^🔗	SketchCow	So not so bad.
02:06 ^🔗	SketchCow	They'll start slowing down - some issues are 280-300mb
02:16 ^🔗	joepie91	[...]
02:16 ^🔗	joepie91	Archived 'Another night like this...', posted at 2005-02-06T15:42:00 by Devil's Kitchen
02:16 ^🔗	joepie91	Archived 'Joe Gordon', posted at 2005-01-13T21:45:00 by Devil's Kitchen
02:16 ^🔗	joepie91	Archived 'Toll Free...', posted at 2005-02-22T21:16:00 by Devil's Kitchen
02:16 ^🔗	joepie91	Archived 'Well, hello...', posted at 2005-01-13T21:26:00 by Devil's Kitchen
02:16 ^🔗	joepie91	Scraping http://www.devilskitchen.me.uk/2005_02_01_archive.html...
02:16 ^🔗	joepie91	that seems to go pretty well
02:16 ^🔗	joepie91	now to actually save it
02:20 ^🔗	winr4r	woohoo!
02:34 ^🔗	SketchCow	http://archive.org/details/byte-magazine-1985-01
02:34 ^🔗	SketchCow	319mb!!
02:34 ^🔗	joepie91	winr4r: scraping now
02:34 ^🔗	joepie91	at most
02:34 ^🔗	joepie91	shouldn't take much more than a minute or 2
02:34 ^🔗	winr4r	huzzah
02:36 ^🔗	SketchCow	Are you using something that blows it into .warc as well?
02:36 ^🔗	joepie91	lol, I was 403'd
02:37 ^🔗	joepie91	SketchCow: no, I'm actually parsing the archives pages
02:37 ^🔗	joepie91	archive *
02:41 ^🔗	joepie91	okay, let's try it again from another IP with a bit more delay inbetween >.>
02:42 ^🔗	joepie91	this will take a while :P
02:42 ^🔗	joepie91	SketchCow: output is JSON with post title, author name, posting date, and body
02:42 ^🔗	joepie91	body being the HTML of the particular post
02:43 ^🔗	joepie91	root@aarnist:~/devilskitchen# find -type f \| wc -l
02:43 ^🔗	joepie91	so far
02:43 ^🔗	joepie91	140
02:45 ^🔗	joepie91	365...
02:45 ^🔗	joepie91	388...
02:46 ^🔗	joepie91	I've arrived at 2006 by now :P
02:51 ^🔗	joepie91	if anyone cares, scraper source: http://git.cryto.net/cgit/joepie91/tree/tools/scrapers/devilskitchen.py
02:51 ^🔗	joepie91	cc winr4r
02:51 ^🔗	joepie91	786 posts archived so far, around 2007-10 now
02:52 ^🔗	chronomex	this is archivey enough for #archiveteam
02:52 ^🔗	chronomex	imo
02:52 ^🔗	joepie91	mm... fair enough
02:52 ^🔗	joepie91	will move the convo there then :)
02:52 ^🔗	winr4r	k
02:52 ^🔗	winr4r	k
02:52 ^🔗	winr4r	k
02:52 ^🔗	winr4r	WHOAH
02:52 ^🔗	winr4r	sorry, trying to write on my netbook in the dark :/
02:52 ^🔗	joepie91	winr4r: you're not in #archiveteam
02:53 ^🔗	joepie91	and lol
02:53 ^🔗	winr4r	joepie91: i'm not
02:54 ^🔗	winr4r	also, going to sleep with a cat tucked in behind my knees
02:54 ^🔗	joepie91	haha
02:55 ^🔗	joepie91	winr4r: you don't want to see the result then? :P
02:55 ^🔗	winr4r	there is literally nothing in the world that feels better than this
02:55 ^🔗	joepie91	only 4 more years worth of posts to go
02:55 ^🔗	joepie91	heh
02:55 ^🔗	chronomex	winr4r: sex is nice too.
02:55 ^🔗	winr4r	joepie91: i really do, but as for me, and right now, there is me and my neighbour's cat
02:55 ^🔗	winr4r	we're going to both sleep very well
02:55 ^🔗	joepie91	:P
02:55 ^🔗	winr4r	gnight folks
02:55 ^🔗	joepie91	night
02:56 ^🔗	joepie91	goddamnit.
02:56 ^🔗	joepie91	403'd again.
02:57 ^🔗	joepie91	annoying.
03:19 ^🔗	joepie91	SketchCow: suggestions for places to upload the resulting scrape?
03:19 ^🔗	joepie91	fit for archive.org, for example?
03:24 ^🔗	joepie91	for reference, here is the full scrape (minus the pages that 403ed for some reason): http://aarnist.cryto.net:81/devilskitchen.tar.gz cc winr4r
03:25 ^🔗	chronomex	joepie91: where's the .warc?
03:25 ^🔗	joepie91	chronomex: there is none
03:25 ^🔗	chronomex	why not?
03:25 ^🔗	joepie91	because I scraped the actual blog posts, and not the site as a whole
03:26 ^🔗	chronomex	just the content, not even the html?
03:26 ^🔗	joepie91	chronomex: as mentioned earlier, it has the title, author, date, and body of every blog post
03:26 ^🔗	joepie91	:P
03:26 ^🔗	joepie91	if you really want a .warc, feel free to run wget-warc, because I don't have it here
03:26 ^🔗	chronomex	ah, ok
03:26 ^🔗	joepie91	it's a pretty small site anyway
03:27 ^🔗	chronomex	have a list of urls I can work from?
03:27 ^🔗	joepie91	saving the archive pages suffices, because it doesn't shorten the articles
03:27 ^🔗	joepie91	sure, 1 sec
03:27 ^🔗	chronomex	archive pages don't get comments :)
03:28 ^🔗	joepie91	http://pastie.org/4778385
03:28 ^🔗	joepie91	there you go
03:28 ^🔗	joepie91	correct
03:28 ^🔗	joepie91	but considering it's google, doing anything more is a bit tricky
03:28 ^🔗	joepie91	:/
03:28 ^🔗	joepie91	google is incredibly hostile towards scrapers and bots in my experience
03:28 ^🔗	chronomex	:(
03:28 ^🔗	joepie91	it 403d my home IP for a short while (entirely, not just for a few pages)
03:28 ^🔗	joepie91	after I scraped with a 5 second interval
03:29 ^🔗	chronomex	Disallow: /search
03:29 ^🔗	chronomex	User-agent: *
03:29 ^🔗	chronomex	Allow: /
03:29 ^🔗	chronomex	LIES
03:29 ^🔗	joepie91	hmm? :P
03:29 ^🔗	chronomex	in /robots.txt
03:30 ^🔗	joepie91	that doesn't make it not hostile towards bots/scrapers :)
03:30 ^🔗	chronomex	not relevant: http://www.reddit.com/r/obots
03:30 ^🔗	chronomex	hahahaha http://www.reddit.com/robots.txt
03:31 ^🔗	chronomex	User-Agent: bender
03:31 ^🔗	chronomex	Disallow: /my_shiny_metal_ass
03:31 ^🔗	chronomex	Disallow: /earth
03:31 ^🔗	chronomex	User-Agent: Gort
03:31 ^🔗	joepie91	lol
03:33 ^🔗	SketchCow	joepie91: Get it all together and it has a home in the archiveteam collection at archive.org.
03:34 ^🔗	joepie91	SketchCow: right, I have a JSON dump of all the articles packed up here: http://aarnist.cryto.net:81/devilskitchen.tar.gz
03:34 ^🔗	joepie91	is that sufficient?
03:34 ^🔗	joepie91	title, author, date, body
03:35 ^🔗	SketchCow	How many articles
03:35 ^🔗	joepie91	1114
03:52 ^🔗	SketchCow	OK, so.
03:52 ^🔗	SketchCow	you have acopy
03:52 ^🔗	SketchCow	you really want a warc copy as well.
03:52 ^🔗	SketchCow	You want a couple good copies, so we have something to work with in the future
03:52 ^🔗	SketchCow	WARC is what archive.org wants, although it's clunky in contemporary space for now
04:02 ^🔗	joepie91	SketchCow:
04:02 ^🔗	joepie91	cat: css.c: No such file or directory
04:02 ^🔗	joepie91	make[3]: *** [css_.c] Error 1
04:02 ^🔗	joepie91	make[3]: Leaving directory `/root/wget-warc/trunk/src'
04:02 ^🔗	joepie91	when compiling wget-warc
04:02 ^🔗	joepie91	any suggestions?
04:04 ^🔗	joepie91	debian 6 btw
04:07 ^🔗	joepie91	ah, problem solved it seems
04:07 ^🔗	joepie91	apt-get install flex && ./configure && make
04:10 ^🔗	joepie91	help ._.
04:10 ^🔗	joepie91	make[2]: *** No rule to make target `Makevars', needed by `Makefile'. Stop.
04:13 ^🔗	joepie91	right, I think it works now
04:22 ^🔗	joepie91	finally found a command that does the job
04:22 ^🔗	joepie91	lol
04:24 ^🔗	joepie91	SketchCow: okay, wget-warc'ing the blog now, let's see if I get through without google banning me
04:24 ^🔗	joepie91	it ran against a no-index, so I had to ignore it
04:24 ^🔗	joepie91	er
04:24 ^🔗	joepie91	no-follow *
05:51 ^🔗	DFJustin	<joepie91> SketchCow: going to a non-archived URL via wayback machine adds it to archive queue? <-- technically it doesn't add it to a queue, it just does a grab of the page right then
05:51 ^🔗	DFJustin	+ any prerequisites that your browser fetches
07:33 ^🔗	godane	i may do a better pull of hackaday.com
07:33 ^🔗	godane	mostly cause the images are not in warc.gz format
09:13 ^🔗	alard	joepie91: The most recent Wget release (1.14) has warc support built-in. It looks like you've compiled an older version (one with a "trunk" directory), so it might be useful to upgrade if you plan to use it again.
10:14 ^🔗	winr4r	joepie91: you're wonderful
10:14 ^🔗	winr4r	good job
13:30 ^🔗	SketchCow	Uploading a few hundred Laptop manuals
13:33 ^🔗	winr4r	good morning jason!
13:33 ^🔗	winr4r	and hello mistym
13:33 ^🔗	mistym	Morning!
13:34 ^🔗	mistym	Ugggh, why did it have to get so cold so fast? I mean it is Winnipeg, but... :/
13:35 ^🔗	winr4r	it got much colder in the last couple of days here, too
13:40 ^🔗	joepie91	SketchCow, winr4r, tar.gz with both a warc and a json dump of the blog in it: http://aarnist.cryto.net:81/devilskitchen_final.tar.gz
13:40 ^🔗	joepie91	warc seems to have completed successfully
13:41 ^🔗	joepie91	(surprisingly)
13:45 ^🔗	winr4r	joepie91: good job :)
13:53 ^🔗	SketchCow	http://archive.org/details/archiveteam-devilskitchen-panic
14:03 ^🔗	winr4r	yay!
14:08 ^🔗	joepie91	\o/
16:43 ^🔗	godane	SketchCow: just for you to know i'm getting ~40000 exterinal images form my underground-gamer.com dump
16:43 ^🔗	godane	also i think there is enough stuff in this dump just to do a talk on pirates again
19:24 ^🔗	joepie91	would you look at that, WHOIS data in JSON format :)
19:24 ^🔗	joepie91	http://whois.cryto.net/ :D
20:46 ^🔗	DFJustin	on the subject of manual uploads, might as well toot my own horn http://archive.org/search.php?query=subject%3A%22computer%20history%22%20AND%20uploader%3A%22dopefishjustin%40gmail.com%22%20AND%20collection%3Aopensource&sort=-publicdate
20:57 ^🔗	dashcloud	looks nice

irclogger-viewer