#archiveteam 2013-02-04,Mon

↑back Search

Time Nickname Message
03:44 🔗 xk_id_ crawler is crawling...
03:44 🔗 xk_id_ 200 out of 2M
03:44 🔗 xk_id_ one worker... site seems to hold.
03:45 🔗 xk_id_ I'll let it run tonight, and tomorrow I'll bring in another worker.
03:52 🔗 db48x does anyone here use Chef or Puppet?
04:41 🔗 chazchaz Is weblog.nl done? I'm getting the tracker ratelimiting message.
05:31 🔗 db48x chazchaz: yep :)
05:42 🔗 chazchaz thanks
07:01 🔗 db48x chazchaz: you're welcome
08:55 🔗 SmileyG \o//
08:55 🔗 SmileyG now on xanga, sweet :)
09:21 🔗 db48x alard: nice upgrade to the tracker
10:20 🔗 X-Scale I wonder if Shredder Challenge has been archived yet -> http://archive.darpa.mil/shredderchallenge/
11:35 🔗 godane i found something interesting
11:35 🔗 godane https://www.youtube.com/user/filmnutlive
11:35 🔗 godane there maybe 400+ hour long interviews with actors
13:47 🔗 xk_id I got banned after only a few hours. The entire amazon aws ip range seems to have got banned.
13:47 🔗 xk_id There's not much I can do, is it?
13:48 🔗 xk_id (I mean, I'm trying with a different AWS node, and it's timing out too)
14:08 🔗 ersi xk_id: Well, other VM providers
14:08 🔗 xk_id hmmmm...
14:08 🔗 ersi You're probably easy to block anyhow though, seeing how you're doing a lot more traffic than everyone else
14:08 🔗 xk_id yes...
14:09 🔗 xk_id But I'm really scared. If I can't collect the data, my dissertation is ruined.
14:09 🔗 ersi Too bad, heh.
14:35 🔗 xk_id Basically, I just need a computer with sparse resources.
14:35 🔗 xk_id I mean, several.
14:35 🔗 xk_id What should I look for?
14:35 🔗 xk_id cloud computing usually is too fancy and offers much more than what I need.
14:36 🔗 xk_id my server will stay in AWS EC2, but my workers can come from anywhere.
14:36 🔗 SmileyG maybe your disseration shouldn't of been based on infomation you didn't have <shakes head>
14:38 🔗 SmileyG as for suggestions, I really don't have any unfortunately
14:38 🔗 SmileyG you could try something fulgy like tor but I doubt it'd help.
14:38 🔗 xk_id I'm just thinking of a way to get a bunch of workers from different providers
14:38 🔗 xk_id basically, renting a machine on which I can run a small crawler and a ssh tunnel
14:43 🔗 SmileyG some small easily deployed proxy then...
14:43 🔗 SmileyG btw I have no idea what course your on but this is simular to the issues I had with my disseration, the fixing the issues ended up more interesting than the end data itself.
14:43 🔗 SmileyG and we should go to #archiveteam-bs to discuss futher. (You'll likely get more noticed there too).
14:43 🔗 xk_id sure
16:57 🔗 Schbirid what is a good tool to sniff what a browser POSTs?
16:58 🔗 Schbirid chrome inspector works well, post, then scroll up in the request list
16:59 🔗 tef http proxy
17:07 🔗 DFJustin firebug on firefox
22:07 🔗 tjvc Hi
22:12 🔗 mistym Hi tjvc
22:14 🔗 tjvc Hey mistym. I've got a question about warc files. Can I ask here?
22:15 🔗 chronomex sure can
22:15 🔗 tjvc Awesome.
22:15 🔗 chronomex we might even have the right answer ;)
22:16 🔗 tjvc So basically I'm experimenting with a bit of web archiving with wget, and want to stick as closely as possible to archival 'best practices', so I'm saving my crawls to a warc file.
22:17 🔗 chronomex cool
22:17 🔗 chronomex you write your own crawler or use one that already exists?
22:17 🔗 tjvc I'm just using a wget command for now.
22:18 🔗 tjvc Thing is, I want to do this regularly, and take snapshots of various websites, but it seems that warc and wget timestamping aren't compatible.
22:18 🔗 chronomex what's wget timestamping do again?
22:19 🔗 tjvc It checks last modified times of files on server against local copies and only downloads pages if files have changed.
22:19 🔗 chronomex ah right
22:19 🔗 chronomex hmmm
22:19 🔗 chronomex you're also saving to files, right?
22:19 🔗 chronomex wget can't read warcs
22:20 🔗 tjvc Yeah, saving to files. Wget can do warc output.
22:21 🔗 chronomex right
22:21 🔗 tjvc I'm just wondering if there's a more efficient method of doing an 'update' crawl with warc output than downloading the entire site every time, which seems to be my only option with wget.
22:21 🔗 chronomex hmm.
22:21 🔗 tjvc Am I making sense?
22:22 🔗 chronomex yes
22:22 🔗 chronomex what options are you using on wget?
22:22 🔗 chronomex I presume --timestamping among others?
22:23 🔗 tjvc --recursive,--warc-file,--no-parent,--user-agent,--page-requisites,--convert-links
22:23 🔗 chronomex hmm ok
22:23 🔗 tjvc If I try and use --timestamping with --warc-file I get an error:
22:23 🔗 tjvc "WARC output does not work with timestamping, timestamping will be disabled."
22:23 🔗 chronomex oh really, I didn't know that
22:23 🔗 chronomex okay
22:23 🔗 chronomex hmm
22:23 🔗 tjvc It does kind of make sense.
22:24 🔗 chronomex you could do some custom header stuff, like adding an If-Modified-Since with the timestamp of the file you have on hand
22:25 🔗 tjvc Hmm.
22:27 🔗 chronomex that's all I've got
22:27 🔗 chronomex wg 12
22:27 🔗 chronomex misfire
22:27 🔗 tjvc Just looking into that now: didn't realize you could do that with wget.
22:29 🔗 tjvc It's a good idea. I really need to familiarize myself with warc files a bit more. It may be that having lots of partial crawls stored in warc files is actually a bit pointless.
22:29 🔗 chronomex maybe.
22:31 🔗 tjvc Thanks a lot for the advice chronomex.
22:31 🔗 chronomex my pleasure, any time
22:32 🔗 alard tjvc: Wget does have an option to remove duplicate records from the warc.
22:33 🔗 tjvc --warc-cdx and --warc-dedup ?
22:33 🔗 alard Yes. It will still downloads the files, but it won't store them in the second warc file.
22:33 🔗 alard It checks the MD5 or SHA1 hash of the download, and if it matches it adds a reference to the record in the earlier warc.
22:35 🔗 tjvc Ok. That half solves the problem
22:35 🔗 alard (A 'revisit' record, section 6.7.2 on page 15 of http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf )
22:36 🔗 tjvc If you were taking regular snapshots of a site, would you do this, or would you take a complete copy each time?
22:36 🔗 tjvc Would be interesting to know how others would go about it.
22:37 🔗 chronomex depends on how your storage is structured
22:37 🔗 chronomex you don't want to have one copy of a thing and then have it disappear
22:37 🔗 chronomex but you don't need ten million copies
22:39 🔗 alard You could do a complete copy regularly, with updates in between.
22:40 🔗 tjvc I'm thinking about monthly crawls. The site has ~10,000 pages.
22:40 🔗 alard (I'm doing something like that here, though that has more of a technical reason: https://archive.org/details/dutch-news-homepages-2013-02)
22:40 🔗 chronomex tjvc: what change frequency do they have?
22:41 🔗 tjvc Most of it is archived news or blog posts etc., so the bulk of the content is probably static.
22:42 🔗 db48x let your storage handle redundancy
22:42 🔗 db48x raid + backups
22:42 🔗 tjvc db48x: sure, that's what we'll be doing.
23:09 🔗 tjvc chronomex: Using If-Modified-Since header works well, although if index page has not been modified wget stops there, which could be problematic. Thanks.
23:09 🔗 chronomex cool, good to hear :)

irclogger-viewer