#archiveteam 2013-10-12,Sat

↑back Search

Time	Nickname	Message
00:03 ^🔗	godane	Sean Carruthers: http://www.flickr.com/photos/globalhermit/
01:14 ^🔗	balrog	http://tag3ulp55xczs3pn.onion needs to be archived
01:14 ^🔗	balrog	someone spammed the forum with a ton of crap ;/
01:15 ^🔗	balrog	but the data is still there
04:35 ^🔗	kyan	balrog: I grabbed the Requiem website: https://archive.org/details/RequiemArchiveBegun11Oct2013Meta.warc
16:06 ^🔗	kyan	Is there any way to direct wget+WARC to download directly into the WARC instead of saving the file? I've been getting a "File name too long" error when running some mirrors.
16:07 ^🔗	omf_	kyan, nope
16:08 ^🔗	kyan	omf_: is there any solution to that problem, other than manually grabbing every URL?
16:08 ^🔗	omf_	Not that I know of
16:11 ^🔗	balrog	omf_: afaik wget+warc will write the right thing into the warc file
16:11 ^🔗	chfoo	i thought the wget lua version we use saves it directly with just temp file to be used
16:11 ^🔗	chfoo	using the --output-document and --truncate-output options
16:12 ^🔗	omf_	Try it out and see if it works
16:14 ^🔗	kyan	hmm. I'm using the built in WARC output in wget 1.14. I'll do some testsâ¦
16:27 ^🔗	kyan	Interestingâ¦ with --output-document, the file gets downloaded correctly, but the WARC file itself is missing the content: http://pastebin.com/npzP1c7v
16:44 ^🔗	chfoo	i think wget lua is patched to fix that problem, the --truncate-output option modifies --output-document so it doesnt append to the existing file but just overwrites it as it goes
16:47 ^🔗	chfoo	actually.. i just tried out your command with normal wget and my warc file contains content
16:49 ^🔗	chfoo	how are looking at the warc file, what you pasted isn't the complete warc. since each record is a seperate gzip file, it might have stopped opening it on the first file
17:13 ^🔗	bsmith094	7zip ultra compression rules, 10 12x
17:20 ^🔗	kyan	chfoo: I did another test. Here is: the command I ran (a script), the output of the command to the terminal, and all files that wget put in the folder: http://futuramerlin.com/Long-URL-test-12October2013.tar.gz
17:25 ^🔗	ersi	I'd say, assume as little as possible about wget-lua
17:25 ^🔗	ersi	It's not almighty
17:25 ^🔗	chfoo	kyan: the warc file should be complete. inside it, i see the 302 redirects, the page's html, and the log file.
17:26 ^🔗	kyan	chfoo: Huh. I'll try extracting it another wayâ¦ weird
17:27 ^🔗	kyan	Now it's working.
17:27 ^🔗	*	kyan is thoroughly confused now
17:27 ^🔗	kyan	I used gunzip to extract it and it was fineâ¦ using graphical archive utilities truncated the file
17:30 ^🔗	chfoo	"warc.gz" is a misnomer
18:14 ^🔗	yipdw	kyan: yes
18:14 ^🔗	yipdw	kyan: oh wait, chfoo already answered that
18:16 ^🔗	yipdw	kyan: for future reference, a gzipped WARC is a sequence of individually gzipped WARC records; this is legal, but a lot of utilities will bork it
18:17 ^🔗	yipdw	the reason why that's done is for read/write efficiency: it's much easier to seek to a record that way than if you gzipped everything, and it's much more efficient to compress+append when generating a WARC from a network stream
18:18 ^🔗	kyan	yipdw: I see, thanks. That makes sense.
18:23 ^🔗	kyan	Hmm. I tried --truncate-output with the latest wget-lua from github, but it says it's an unrecognized optionâ¦
18:29 ^🔗	chfoo	kyan: it's on the "lua" branch if you haven't seen it yet
18:31 ^🔗	chfoo	alternatively, you can just use normal wget and manually truncate the output file it you start running out of disk space
18:31 ^🔗	kyan	chfoo: oh thanks sorry I was looking at master
18:32 ^🔗	chfoo	i'll fix the wiki and default branch since i made the same mistake a few times
18:33 ^🔗	kyan	chfoo: Disk space hasn't been an issueâ¦ just the long file name issue.
18:36 ^🔗	chfoo	kyan: oh, ok. the choice is up to you. wget-lua should be ok, there hasn't been an issue about it so far
18:38 ^🔗	kyan	chfoo: Cool. Thanks. I'm just going to keep trying things until I arrive at what works :D
20:05 ^🔗	Sellyme	Oh hey, my upload speed just tripled. Sweet.
20:36 ^🔗	robv	so, blip's getting screwed over, huh? it's like viddler all over again with these content wipes
20:39 ^🔗	ersi_	We got a project up and running for blip.tv. Feel free to contribute
20:44 ^🔗	robv	got two warrior instances up and running
20:50 ^🔗	*	closure notices he still has a git clone of heathcare.gov.. guess that puppy is going to archive.org now that they've nuked it from github..
20:58 ^🔗	godane	you have a git version of heathcare.gov?
20:58 ^🔗	yipdw	wait, that was on github at one point?
21:39 ^🔗	omf_	closure, put it up on IA, I have people asking me for it
22:04 ^🔗	andy0	is archivebot a thing to run at this time?
22:10 ^🔗	yipdw	it's online, yes
22:43 ^🔗	closure	omf_: uploading now
22:51 ^🔗	closure	http://ia801006.us.archive.org/33/items/healthcare-gov-gitrepo/SHA1E-s4715982--dc21e50fd159fb228da9f8c06fecb6f2e0681575.gov.git
23:40 ^🔗	closure	anyone seen this before? https://conservatory.github.io/
23:41 ^🔗	closure	(and no, I don't mean the broken ssl cert for github.io, although that's pretty funny)
23:43 ^🔗	*	closure kicks himself for not having run github-backup in that repo
23:50 ^🔗	balrog	which repo?

irclogger-viewer