Time |
Nickname |
Message |
03:44
🔗
|
xk_id_ |
crawler is crawling... |
03:44
🔗
|
xk_id_ |
200 out of 2M |
03:44
🔗
|
xk_id_ |
one worker... site seems to hold. |
03:45
🔗
|
xk_id_ |
I'll let it run tonight, and tomorrow I'll bring in another worker. |
03:52
🔗
|
db48x |
does anyone here use Chef or Puppet? |
04:41
🔗
|
chazchaz |
Is weblog.nl done? I'm getting the tracker ratelimiting message. |
05:31
🔗
|
db48x |
chazchaz: yep :) |
05:42
🔗
|
chazchaz |
thanks |
07:01
🔗
|
db48x |
chazchaz: you're welcome |
08:55
🔗
|
SmileyG |
\o// |
08:55
🔗
|
SmileyG |
now on xanga, sweet :) |
09:21
🔗
|
db48x |
alard: nice upgrade to the tracker |
10:20
🔗
|
X-Scale |
I wonder if Shredder Challenge has been archived yet -> http://archive.darpa.mil/shredderchallenge/ |
11:35
🔗
|
godane |
i found something interesting |
11:35
🔗
|
godane |
https://www.youtube.com/user/filmnutlive |
11:35
🔗
|
godane |
there maybe 400+ hour long interviews with actors |
13:47
🔗
|
xk_id |
I got banned after only a few hours. The entire amazon aws ip range seems to have got banned. |
13:47
🔗
|
xk_id |
There's not much I can do, is it? |
13:48
🔗
|
xk_id |
(I mean, I'm trying with a different AWS node, and it's timing out too) |
14:08
🔗
|
ersi |
xk_id: Well, other VM providers |
14:08
🔗
|
xk_id |
hmmmm... |
14:08
🔗
|
ersi |
You're probably easy to block anyhow though, seeing how you're doing a lot more traffic than everyone else |
14:08
🔗
|
xk_id |
yes... |
14:09
🔗
|
xk_id |
But I'm really scared. If I can't collect the data, my dissertation is ruined. |
14:09
🔗
|
ersi |
Too bad, heh. |
14:35
🔗
|
xk_id |
Basically, I just need a computer with sparse resources. |
14:35
🔗
|
xk_id |
I mean, several. |
14:35
🔗
|
xk_id |
What should I look for? |
14:35
🔗
|
xk_id |
cloud computing usually is too fancy and offers much more than what I need. |
14:36
🔗
|
xk_id |
my server will stay in AWS EC2, but my workers can come from anywhere. |
14:36
🔗
|
SmileyG |
maybe your disseration shouldn't of been based on infomation you didn't have <shakes head> |
14:38
🔗
|
SmileyG |
as for suggestions, I really don't have any unfortunately |
14:38
🔗
|
SmileyG |
you could try something fulgy like tor but I doubt it'd help. |
14:38
🔗
|
xk_id |
I'm just thinking of a way to get a bunch of workers from different providers |
14:38
🔗
|
xk_id |
basically, renting a machine on which I can run a small crawler and a ssh tunnel |
14:43
🔗
|
SmileyG |
some small easily deployed proxy then... |
14:43
🔗
|
SmileyG |
btw I have no idea what course your on but this is simular to the issues I had with my disseration, the fixing the issues ended up more interesting than the end data itself. |
14:43
🔗
|
SmileyG |
and we should go to #archiveteam-bs to discuss futher. (You'll likely get more noticed there too). |
14:43
🔗
|
xk_id |
sure |
16:57
🔗
|
Schbirid |
what is a good tool to sniff what a browser POSTs? |
16:58
🔗
|
Schbirid |
chrome inspector works well, post, then scroll up in the request list |
16:59
🔗
|
tef |
http proxy |
17:07
🔗
|
DFJustin |
firebug on firefox |
22:07
🔗
|
tjvc |
Hi |
22:12
🔗
|
mistym |
Hi tjvc |
22:14
🔗
|
tjvc |
Hey mistym. I've got a question about warc files. Can I ask here? |
22:15
🔗
|
chronomex |
sure can |
22:15
🔗
|
tjvc |
Awesome. |
22:15
🔗
|
chronomex |
we might even have the right answer ;) |
22:16
🔗
|
tjvc |
So basically I'm experimenting with a bit of web archiving with wget, and want to stick as closely as possible to archival 'best practices', so I'm saving my crawls to a warc file. |
22:17
🔗
|
chronomex |
cool |
22:17
🔗
|
chronomex |
you write your own crawler or use one that already exists? |
22:17
🔗
|
tjvc |
I'm just using a wget command for now. |
22:18
🔗
|
tjvc |
Thing is, I want to do this regularly, and take snapshots of various websites, but it seems that warc and wget timestamping aren't compatible. |
22:18
🔗
|
chronomex |
what's wget timestamping do again? |
22:19
🔗
|
tjvc |
It checks last modified times of files on server against local copies and only downloads pages if files have changed. |
22:19
🔗
|
chronomex |
ah right |
22:19
🔗
|
chronomex |
hmmm |
22:19
🔗
|
chronomex |
you're also saving to files, right? |
22:19
🔗
|
chronomex |
wget can't read warcs |
22:20
🔗
|
tjvc |
Yeah, saving to files. Wget can do warc output. |
22:21
🔗
|
chronomex |
right |
22:21
🔗
|
tjvc |
I'm just wondering if there's a more efficient method of doing an 'update' crawl with warc output than downloading the entire site every time, which seems to be my only option with wget. |
22:21
🔗
|
chronomex |
hmm. |
22:21
🔗
|
tjvc |
Am I making sense? |
22:22
🔗
|
chronomex |
yes |
22:22
🔗
|
chronomex |
what options are you using on wget? |
22:22
🔗
|
chronomex |
I presume --timestamping among others? |
22:23
🔗
|
tjvc |
--recursive,--warc-file,--no-parent,--user-agent,--page-requisites,--convert-links |
22:23
🔗
|
chronomex |
hmm ok |
22:23
🔗
|
tjvc |
If I try and use --timestamping with --warc-file I get an error: |
22:23
🔗
|
tjvc |
"WARC output does not work with timestamping, timestamping will be disabled." |
22:23
🔗
|
chronomex |
oh really, I didn't know that |
22:23
🔗
|
chronomex |
okay |
22:23
🔗
|
chronomex |
hmm |
22:23
🔗
|
tjvc |
It does kind of make sense. |
22:24
🔗
|
chronomex |
you could do some custom header stuff, like adding an If-Modified-Since with the timestamp of the file you have on hand |
22:25
🔗
|
tjvc |
Hmm. |
22:27
🔗
|
chronomex |
that's all I've got |
22:27
🔗
|
chronomex |
wg 12 |
22:27
🔗
|
chronomex |
misfire |
22:27
🔗
|
tjvc |
Just looking into that now: didn't realize you could do that with wget. |
22:29
🔗
|
tjvc |
It's a good idea. I really need to familiarize myself with warc files a bit more. It may be that having lots of partial crawls stored in warc files is actually a bit pointless. |
22:29
🔗
|
chronomex |
maybe. |
22:31
🔗
|
tjvc |
Thanks a lot for the advice chronomex. |
22:31
🔗
|
chronomex |
my pleasure, any time |
22:32
🔗
|
alard |
tjvc: Wget does have an option to remove duplicate records from the warc. |
22:33
🔗
|
tjvc |
--warc-cdx and --warc-dedup ? |
22:33
🔗
|
alard |
Yes. It will still downloads the files, but it won't store them in the second warc file. |
22:33
🔗
|
alard |
It checks the MD5 or SHA1 hash of the download, and if it matches it adds a reference to the record in the earlier warc. |
22:35
🔗
|
tjvc |
Ok. That half solves the problem |
22:35
🔗
|
alard |
(A 'revisit' record, section 6.7.2 on page 15 of http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf ) |
22:36
🔗
|
tjvc |
If you were taking regular snapshots of a site, would you do this, or would you take a complete copy each time? |
22:36
🔗
|
tjvc |
Would be interesting to know how others would go about it. |
22:37
🔗
|
chronomex |
depends on how your storage is structured |
22:37
🔗
|
chronomex |
you don't want to have one copy of a thing and then have it disappear |
22:37
🔗
|
chronomex |
but you don't need ten million copies |
22:39
🔗
|
alard |
You could do a complete copy regularly, with updates in between. |
22:40
🔗
|
tjvc |
I'm thinking about monthly crawls. The site has ~10,000 pages. |
22:40
🔗
|
alard |
(I'm doing something like that here, though that has more of a technical reason: https://archive.org/details/dutch-news-homepages-2013-02) |
22:40
🔗
|
chronomex |
tjvc: what change frequency do they have? |
22:41
🔗
|
tjvc |
Most of it is archived news or blog posts etc., so the bulk of the content is probably static. |
22:42
🔗
|
db48x |
let your storage handle redundancy |
22:42
🔗
|
db48x |
raid + backups |
22:42
🔗
|
tjvc |
db48x: sure, that's what we'll be doing. |
23:09
🔗
|
tjvc |
chronomex: Using If-Modified-Since header works well, although if index page has not been modified wget stops there, which could be problematic. Thanks. |
23:09
🔗
|
chronomex |
cool, good to hear :) |