[03:44] crawler is crawling... [03:44] 200 out of 2M [03:44] one worker... site seems to hold. [03:45] I'll let it run tonight, and tomorrow I'll bring in another worker. [03:52] does anyone here use Chef or Puppet? [04:41] Is weblog.nl done? I'm getting the tracker ratelimiting message. [05:31] chazchaz: yep :) [05:42] thanks [07:01] chazchaz: you're welcome [08:55] \o// [08:55] now on xanga, sweet :) [09:21] alard: nice upgrade to the tracker [10:20] I wonder if Shredder Challenge has been archived yet -> http://archive.darpa.mil/shredderchallenge/ [11:35] i found something interesting [11:35] https://www.youtube.com/user/filmnutlive [11:35] there maybe 400+ hour long interviews with actors [13:47] I got banned after only a few hours. The entire amazon aws ip range seems to have got banned. [13:47] There's not much I can do, is it? [13:48] (I mean, I'm trying with a different AWS node, and it's timing out too) [14:08] xk_id: Well, other VM providers [14:08] hmmmm... [14:08] You're probably easy to block anyhow though, seeing how you're doing a lot more traffic than everyone else [14:08] yes... [14:09] But I'm really scared. If I can't collect the data, my dissertation is ruined. [14:09] Too bad, heh. [14:35] Basically, I just need a computer with sparse resources. [14:35] I mean, several. [14:35] What should I look for? [14:35] cloud computing usually is too fancy and offers much more than what I need. [14:36] my server will stay in AWS EC2, but my workers can come from anywhere. [14:36] maybe your disseration shouldn't of been based on infomation you didn't have [14:38] as for suggestions, I really don't have any unfortunately [14:38] you could try something fulgy like tor but I doubt it'd help. [14:38] I'm just thinking of a way to get a bunch of workers from different providers [14:38] basically, renting a machine on which I can run a small crawler and a ssh tunnel [14:43] some small easily deployed proxy then... [14:43] btw I have no idea what course your on but this is simular to the issues I had with my disseration, the fixing the issues ended up more interesting than the end data itself. [14:43] and we should go to #archiveteam-bs to discuss futher. (You'll likely get more noticed there too). [14:43] sure [16:57] what is a good tool to sniff what a browser POSTs? [16:58] chrome inspector works well, post, then scroll up in the request list [16:59] http proxy [17:07] firebug on firefox [22:07] Hi [22:12] Hi tjvc [22:14] Hey mistym. I've got a question about warc files. Can I ask here? [22:15] sure can [22:15] Awesome. [22:15] we might even have the right answer ;) [22:16] So basically I'm experimenting with a bit of web archiving with wget, and want to stick as closely as possible to archival 'best practices', so I'm saving my crawls to a warc file. [22:17] cool [22:17] you write your own crawler or use one that already exists? [22:17] I'm just using a wget command for now. [22:18] Thing is, I want to do this regularly, and take snapshots of various websites, but it seems that warc and wget timestamping aren't compatible. [22:18] what's wget timestamping do again? [22:19] It checks last modified times of files on server against local copies and only downloads pages if files have changed. [22:19] ah right [22:19] hmmm [22:19] you're also saving to files, right? [22:19] wget can't read warcs [22:20] Yeah, saving to files. Wget can do warc output. [22:21] right [22:21] I'm just wondering if there's a more efficient method of doing an 'update' crawl with warc output than downloading the entire site every time, which seems to be my only option with wget. [22:21] hmm. [22:21] Am I making sense? [22:22] yes [22:22] what options are you using on wget? [22:22] I presume --timestamping among others? [22:23] --recursive,--warc-file,--no-parent,--user-agent,--page-requisites,--convert-links [22:23] hmm ok [22:23] If I try and use --timestamping with --warc-file I get an error: [22:23] "WARC output does not work with timestamping, timestamping will be disabled." [22:23] oh really, I didn't know that [22:23] okay [22:23] hmm [22:23] It does kind of make sense. [22:24] you could do some custom header stuff, like adding an If-Modified-Since with the timestamp of the file you have on hand [22:25] Hmm. [22:27] that's all I've got [22:27] wg 12 [22:27] misfire [22:27] Just looking into that now: didn't realize you could do that with wget. [22:29] It's a good idea. I really need to familiarize myself with warc files a bit more. It may be that having lots of partial crawls stored in warc files is actually a bit pointless. [22:29] maybe. [22:31] Thanks a lot for the advice chronomex. [22:31] my pleasure, any time [22:32] tjvc: Wget does have an option to remove duplicate records from the warc. [22:33] --warc-cdx and --warc-dedup ? [22:33] Yes. It will still downloads the files, but it won't store them in the second warc file. [22:33] It checks the MD5 or SHA1 hash of the download, and if it matches it adds a reference to the record in the earlier warc. [22:35] Ok. That half solves the problem [22:35] (A 'revisit' record, section 6.7.2 on page 15 of http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf ) [22:36] If you were taking regular snapshots of a site, would you do this, or would you take a complete copy each time? [22:36] Would be interesting to know how others would go about it. [22:37] depends on how your storage is structured [22:37] you don't want to have one copy of a thing and then have it disappear [22:37] but you don't need ten million copies [22:39] You could do a complete copy regularly, with updates in between. [22:40] I'm thinking about monthly crawls. The site has ~10,000 pages. [22:40] (I'm doing something like that here, though that has more of a technical reason: https://archive.org/details/dutch-news-homepages-2013-02) [22:40] tjvc: what change frequency do they have? [22:41] Most of it is archived news or blog posts etc., so the bulk of the content is probably static. [22:42] let your storage handle redundancy [22:42] raid + backups [22:42] db48x: sure, that's what we'll be doing. [23:09] chronomex: Using If-Modified-Since header works well, although if index page has not been modified wget stops there, which could be problematic. Thanks. [23:09] cool, good to hear :)