#archiveteam 2013-02-04,Mon

↑back Search

Time	Nickname	Message
03:44 ^🔗	xk_id_	crawler is crawling...
03:44 ^🔗	xk_id_	200 out of 2M
03:44 ^🔗	xk_id_	one worker... site seems to hold.
03:45 ^🔗	xk_id_	I'll let it run tonight, and tomorrow I'll bring in another worker.
03:52 ^🔗	db48x	does anyone here use Chef or Puppet?
04:41 ^🔗	chazchaz	Is weblog.nl done? I'm getting the tracker ratelimiting message.
05:31 ^🔗	db48x	chazchaz: yep :)
05:42 ^🔗	chazchaz	thanks
07:01 ^🔗	db48x	chazchaz: you're welcome
08:55 ^🔗	SmileyG	\o//
08:55 ^🔗	SmileyG	now on xanga, sweet :)
09:21 ^🔗	db48x	alard: nice upgrade to the tracker
10:20 ^🔗	X-Scale	I wonder if Shredder Challenge has been archived yet -> http://archive.darpa.mil/shredderchallenge/
11:35 ^🔗	godane	i found something interesting
11:35 ^🔗	godane	https://www.youtube.com/user/filmnutlive
11:35 ^🔗	godane	there maybe 400+ hour long interviews with actors
13:47 ^🔗	xk_id	I got banned after only a few hours. The entire amazon aws ip range seems to have got banned.
13:47 ^🔗	xk_id	There's not much I can do, is it?
13:48 ^🔗	xk_id	(I mean, I'm trying with a different AWS node, and it's timing out too)
14:08 ^🔗	ersi	xk_id: Well, other VM providers
14:08 ^🔗	xk_id	hmmmm...
14:08 ^🔗	ersi	You're probably easy to block anyhow though, seeing how you're doing a lot more traffic than everyone else
14:08 ^🔗	xk_id	yes...
14:09 ^🔗	xk_id	But I'm really scared. If I can't collect the data, my dissertation is ruined.
14:09 ^🔗	ersi	Too bad, heh.
14:35 ^🔗	xk_id	Basically, I just need a computer with sparse resources.
14:35 ^🔗	xk_id	I mean, several.
14:35 ^🔗	xk_id	What should I look for?
14:35 ^🔗	xk_id	cloud computing usually is too fancy and offers much more than what I need.
14:36 ^🔗	xk_id	my server will stay in AWS EC2, but my workers can come from anywhere.
14:36 ^🔗	SmileyG	maybe your disseration shouldn't of been based on infomation you didn't have <shakes head>
14:38 ^🔗	SmileyG	as for suggestions, I really don't have any unfortunately
14:38 ^🔗	SmileyG	you could try something fulgy like tor but I doubt it'd help.
14:38 ^🔗	xk_id	I'm just thinking of a way to get a bunch of workers from different providers
14:38 ^🔗	xk_id	basically, renting a machine on which I can run a small crawler and a ssh tunnel
14:43 ^🔗	SmileyG	some small easily deployed proxy then...
14:43 ^🔗	SmileyG	btw I have no idea what course your on but this is simular to the issues I had with my disseration, the fixing the issues ended up more interesting than the end data itself.
14:43 ^🔗	SmileyG	and we should go to #archiveteam-bs to discuss futher. (You'll likely get more noticed there too).
14:43 ^🔗	xk_id	sure
16:57 ^🔗	Schbirid	what is a good tool to sniff what a browser POSTs?
16:58 ^🔗	Schbirid	chrome inspector works well, post, then scroll up in the request list
16:59 ^🔗	tef	http proxy
17:07 ^🔗	DFJustin	firebug on firefox
22:07 ^🔗	tjvc	Hi
22:12 ^🔗	mistym	Hi tjvc
22:14 ^🔗	tjvc	Hey mistym. I've got a question about warc files. Can I ask here?
22:15 ^🔗	chronomex	sure can
22:15 ^🔗	tjvc	Awesome.
22:15 ^🔗	chronomex	we might even have the right answer ;)
22:16 ^🔗	tjvc	So basically I'm experimenting with a bit of web archiving with wget, and want to stick as closely as possible to archival 'best practices', so I'm saving my crawls to a warc file.
22:17 ^🔗	chronomex	cool
22:17 ^🔗	chronomex	you write your own crawler or use one that already exists?
22:17 ^🔗	tjvc	I'm just using a wget command for now.
22:18 ^🔗	tjvc	Thing is, I want to do this regularly, and take snapshots of various websites, but it seems that warc and wget timestamping aren't compatible.
22:18 ^🔗	chronomex	what's wget timestamping do again?
22:19 ^🔗	tjvc	It checks last modified times of files on server against local copies and only downloads pages if files have changed.
22:19 ^🔗	chronomex	ah right
22:19 ^🔗	chronomex	hmmm
22:19 ^🔗	chronomex	you're also saving to files, right?
22:19 ^🔗	chronomex	wget can't read warcs
22:20 ^🔗	tjvc	Yeah, saving to files. Wget can do warc output.
22:21 ^🔗	chronomex	right
22:21 ^🔗	tjvc	I'm just wondering if there's a more efficient method of doing an 'update' crawl with warc output than downloading the entire site every time, which seems to be my only option with wget.
22:21 ^🔗	chronomex	hmm.
22:21 ^🔗	tjvc	Am I making sense?
22:22 ^🔗	chronomex	yes
22:22 ^🔗	chronomex	what options are you using on wget?
22:22 ^🔗	chronomex	I presume --timestamping among others?
22:23 ^🔗	tjvc	--recursive,--warc-file,--no-parent,--user-agent,--page-requisites,--convert-links
22:23 ^🔗	chronomex	hmm ok
22:23 ^🔗	tjvc	If I try and use --timestamping with --warc-file I get an error:
22:23 ^🔗	tjvc	"WARC output does not work with timestamping, timestamping will be disabled."
22:23 ^🔗	chronomex	oh really, I didn't know that
22:23 ^🔗	chronomex	okay
22:23 ^🔗	chronomex	hmm
22:23 ^🔗	tjvc	It does kind of make sense.
22:24 ^🔗	chronomex	you could do some custom header stuff, like adding an If-Modified-Since with the timestamp of the file you have on hand
22:25 ^🔗	tjvc	Hmm.
22:27 ^🔗	chronomex	that's all I've got
22:27 ^🔗	chronomex	wg 12
22:27 ^🔗	chronomex	misfire
22:27 ^🔗	tjvc	Just looking into that now: didn't realize you could do that with wget.
22:29 ^🔗	tjvc	It's a good idea. I really need to familiarize myself with warc files a bit more. It may be that having lots of partial crawls stored in warc files is actually a bit pointless.
22:29 ^🔗	chronomex	maybe.
22:31 ^🔗	tjvc	Thanks a lot for the advice chronomex.
22:31 ^🔗	chronomex	my pleasure, any time
22:32 ^🔗	alard	tjvc: Wget does have an option to remove duplicate records from the warc.
22:33 ^🔗	tjvc	--warc-cdx and --warc-dedup ?
22:33 ^🔗	alard	Yes. It will still downloads the files, but it won't store them in the second warc file.
22:33 ^🔗	alard	It checks the MD5 or SHA1 hash of the download, and if it matches it adds a reference to the record in the earlier warc.
22:35 ^🔗	tjvc	Ok. That half solves the problem
22:35 ^🔗	alard	(A 'revisit' record, section 6.7.2 on page 15 of http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf )
22:36 ^🔗	tjvc	If you were taking regular snapshots of a site, would you do this, or would you take a complete copy each time?
22:36 ^🔗	tjvc	Would be interesting to know how others would go about it.
22:37 ^🔗	chronomex	depends on how your storage is structured
22:37 ^🔗	chronomex	you don't want to have one copy of a thing and then have it disappear
22:37 ^🔗	chronomex	but you don't need ten million copies
22:39 ^🔗	alard	You could do a complete copy regularly, with updates in between.
22:40 ^🔗	tjvc	I'm thinking about monthly crawls. The site has ~10,000 pages.
22:40 ^🔗	alard	(I'm doing something like that here, though that has more of a technical reason: https://archive.org/details/dutch-news-homepages-2013-02)
22:40 ^🔗	chronomex	tjvc: what change frequency do they have?
22:41 ^🔗	tjvc	Most of it is archived news or blog posts etc., so the bulk of the content is probably static.
22:42 ^🔗	db48x	let your storage handle redundancy
22:42 ^🔗	db48x	raid + backups
22:42 ^🔗	tjvc	db48x: sure, that's what we'll be doing.
23:09 ^🔗	tjvc	chronomex: Using If-Modified-Since header works well, although if index page has not been modified wget stops there, which could be problematic. Thanks.
23:09 ^🔗	chronomex	cool, good to hear :)

irclogger-viewer