#archiveteam 2013-10-12,Sat

↑back Search

Time Nickname Message
00:03 πŸ”— godane Sean Carruthers: http://www.flickr.com/photos/globalhermit/
01:14 πŸ”— balrog http://tag3ulp55xczs3pn.onion needs to be archived
01:14 πŸ”— balrog someone spammed the forum with a ton of crap ;/
01:15 πŸ”— balrog but the data *is* still there
04:35 πŸ”— kyan balrog: I grabbed the Requiem website: https://archive.org/details/RequiemArchiveBegun11Oct2013Meta.warc
16:06 πŸ”— kyan Is there any way to direct wget+WARC to download directly into the WARC instead of saving the file? I've been getting a "File name too long" error when running some mirrors.
16:07 πŸ”— omf_ kyan, nope
16:08 πŸ”— kyan omf_: is there any solution to that problem, other than manually grabbing every URL?
16:08 πŸ”— omf_ Not that I know of
16:11 πŸ”— balrog omf_: afaik wget+warc will write the right thing into the warc file
16:11 πŸ”— chfoo i thought the wget lua version we use saves it directly with just temp file to be used
16:11 πŸ”— chfoo using the --output-document and --truncate-output options
16:12 πŸ”— omf_ Try it out and see if it works
16:14 πŸ”— kyan hmm. I'm using the built in WARC output in wget 1.14. I'll do some testsҀ¦
16:27 πŸ”— kyan InterestingҀ¦ with --output-document, the file gets downloaded correctly, but the WARC file itself is missing the content: http://pastebin.com/npzP1c7v
16:44 πŸ”— chfoo i think wget lua is patched to fix that problem, the --truncate-output option modifies --output-document so it doesnt append to the existing file but just overwrites it as it goes
16:47 πŸ”— chfoo actually.. i just tried out your command with normal wget and my warc file contains content
16:49 πŸ”— chfoo how are looking at the warc file, what you pasted isn't the complete warc. since each record is a seperate gzip file, it might have stopped opening it on the first file
17:13 πŸ”— bsmith094 7zip ultra compression rules, 10 12x
17:20 πŸ”— kyan chfoo: I did another test. Here is: the command I ran (a script), the output of the command to the terminal, and all files that wget put in the folder: http://futuramerlin.com/Long-URL-test-12October2013.tar.gz
17:25 πŸ”— ersi I'd say, assume as little as possible about wget-lua
17:25 πŸ”— ersi It's not almighty
17:25 πŸ”— chfoo kyan: the warc file should be complete. inside it, i see the 302 redirects, the page's html, and the log file.
17:26 πŸ”— kyan chfoo: Huh. I'll try extracting it another wayҀ¦ weird
17:27 πŸ”— kyan Now it's working.
17:27 πŸ”— * kyan is thoroughly confused now
17:27 πŸ”— kyan I used gunzip to extract it and it was fineҀ¦ using graphical archive utilities truncated the file
17:30 πŸ”— chfoo "warc.gz" is a misnomer
18:14 πŸ”— yipdw kyan: yes
18:14 πŸ”— yipdw kyan: oh wait, chfoo already answered that
18:16 πŸ”— yipdw kyan: for future reference, a gzipped WARC is a sequence of individually gzipped WARC records; this is legal, but a lot of utilities will bork it
18:17 πŸ”— yipdw the reason why that's done is for read/write efficiency: it's much easier to seek to a record that way than if you gzipped everything, and it's much more efficient to compress+append when generating a WARC from a network stream
18:18 πŸ”— kyan yipdw: I see, thanks. That makes sense.
18:23 πŸ”— kyan Hmm. I tried --truncate-output with the latest wget-lua from github, but it says it's an unrecognized optionҀ¦
18:29 πŸ”— chfoo kyan: it's on the "lua" branch if you haven't seen it yet
18:31 πŸ”— chfoo alternatively, you can just use normal wget and manually truncate the output file it you start running out of disk space
18:31 πŸ”— kyan chfoo: oh thanks sorry I was looking at master
18:32 πŸ”— chfoo i'll fix the wiki and default branch since i made the same mistake a few times
18:33 πŸ”— kyan chfoo: Disk space hasn't been an issueҀ¦ just the long file name issue.
18:36 πŸ”— chfoo kyan: oh, ok. the choice is up to you. wget-lua should be ok, there hasn't been an issue about it so far
18:38 πŸ”— kyan chfoo: Cool. Thanks. I'm just going to keep trying things until I arrive at what works :D
20:05 πŸ”— Sellyme Oh hey, my upload speed just tripled. Sweet.
20:36 πŸ”— robv so, blip's getting screwed over, huh? it's like viddler all over again with these content wipes
20:39 πŸ”— ersi_ We got a project up and running for blip.tv. Feel free to contribute
20:44 πŸ”— robv got two warrior instances up and running
20:50 πŸ”— * closure notices he still has a git clone of heathcare.gov.. guess that puppy is going to archive.org now that they've nuked it from github..
20:58 πŸ”— godane you have a git version of heathcare.gov?
20:58 πŸ”— yipdw wait, that was on github at one point?
21:39 πŸ”— omf_ closure, put it up on IA, I have people asking me for it
22:04 πŸ”— andy0 is archivebot a thing to run at this time?
22:10 πŸ”— yipdw it's online, yes
22:43 πŸ”— closure omf_: uploading now
22:51 πŸ”— closure http://ia801006.us.archive.org/33/items/healthcare-gov-gitrepo/SHA1E-s4715982--dc21e50fd159fb228da9f8c06fecb6f2e0681575.gov.git
23:40 πŸ”— closure anyone seen this before? https://conservatory.github.io/
23:41 πŸ”— closure (and no, I don't mean the broken ssl cert for github.io, although that's pretty funny)
23:43 πŸ”— * closure kicks himself for not having run github-backup in that repo
23:50 πŸ”— balrog which repo?

irclogger-viewer