[00:03] <godane> Sean Carruthers: http://www.flickr.com/photos/globalhermit/
[01:14] <balrog> http://tag3ulp55xczs3pn.onion needs to be archived
[01:14] <balrog> someone spammed the forum with a ton of crap ;/
[01:15] <balrog> but the data *is* still there
[04:35] <kyan> balrog: I grabbed the Requiem website: https://archive.org/details/RequiemArchiveBegun11Oct2013Meta.warc
[16:06] <kyan> Is there any way to direct wget+WARC to download directly into the WARC instead of saving the file? I've been getting a "File name too long" error when running some mirrors.
[16:07] <omf_> kyan, nope
[16:08] <kyan> omf_: is there any solution to that problem, other than manually grabbing every URL?
[16:08] <omf_> Not that I know of
[16:11] <balrog> omf_: afaik wget+warc will write the right thing into the warc file
[16:11] <chfoo> i thought the wget lua version we use saves it directly with just temp file to be used
[16:11] <chfoo> using the --output-document and --truncate-output options
[16:12] <omf_> Try it out and see if it works
[16:14] <kyan> hmm. I'm using the built in WARC output in wget 1.14. I'll do some testsâ¦
[16:27] <kyan> Interestingâ¦ with --output-document, the file gets downloaded correctly, but the WARC file itself is missing the content: http://pastebin.com/npzP1c7v
[16:44] <chfoo> i think wget lua is patched to fix that problem, the --truncate-output option modifies --output-document so it doesnt append to the existing file but just overwrites it as it goes
[16:47] <chfoo> actually.. i just tried out your command with normal wget and my warc file contains content
[16:49] <chfoo> how are looking at the warc file, what you pasted isn't the complete warc. since each record is a seperate gzip file, it might have stopped opening it on the first file
[17:13] <bsmith094> 7zip ultra compression rules, 10 12x
[17:20] <kyan> chfoo: I did another test. Here is: the command I ran (a script), the output of the command to the terminal, and all files that wget put in the folder: http://futuramerlin.com/Long-URL-test-12October2013.tar.gz
[17:25] <ersi> I'd say, assume as little as possible about wget-lua
[17:25] <ersi> It's not almighty
[17:25] <chfoo> kyan: the warc file should be complete. inside it, i see the 302 redirects, the page's html, and the log file.
[17:26] <kyan> chfoo: Huh. I'll try extracting it another wayâ¦ weird
[17:27] <kyan> Now it's working.
[17:27] * kyan is thoroughly confused now
[17:27] <kyan> I used gunzip to extract it and it was fineâ¦ using graphical archive utilities truncated the file
[17:30] <chfoo> "warc.gz" is a misnomer
[18:14] <yipdw> kyan: yes
[18:14] <yipdw> kyan: oh wait, chfoo already answered that
[18:16] <yipdw> kyan: for future reference, a gzipped WARC is a sequence of individually gzipped WARC records; this is legal, but a lot of utilities will bork it
[18:17] <yipdw> the reason why that's done is for read/write efficiency: it's much easier to seek to a record that way than if you gzipped everything, and it's much more efficient to compress+append when generating a WARC from a network stream
[18:18] <kyan> yipdw: I see, thanks. That makes sense.
[18:23] <kyan> Hmm. I tried --truncate-output with the latest wget-lua from github, but it says it's an unrecognized optionâ¦
[18:29] <chfoo> kyan: it's on the "lua" branch if you haven't seen it yet
[18:31] <chfoo> alternatively, you can just use normal wget and manually truncate the output file it you start running out of disk space
[18:31] <kyan> chfoo: oh thanks sorry I was looking at master
[18:32] <chfoo> i'll fix the wiki and default branch since i made the same mistake a few times
[18:33] <kyan> chfoo: Disk space hasn't been an issueâ¦ just the long file name issue.
[18:36] <chfoo> kyan: oh, ok. the choice is up to you. wget-lua should be ok, there hasn't been an issue about it so far
[18:38] <kyan> chfoo: Cool. Thanks. I'm just going to keep trying things until I arrive at what works :D
[20:05] <Sellyme> Oh hey, my upload speed just tripled. Sweet.
[20:36] <robv> so, blip's getting screwed over, huh? it's like viddler all over again with these content wipes
[20:39] <ersi_> We got a project up and running for blip.tv. Feel free to contribute
[20:44] <robv> got two warrior instances up and running
[20:50] * closure notices he still has a git clone of heathcare.gov.. guess that puppy is going to archive.org now that they've nuked it from github..
[20:58] <godane> you have a git version of heathcare.gov?
[20:58] <yipdw> wait, that was on github at one point?
[21:39] <omf_> closure, put it up on IA, I have people asking me for it
[22:04] <andy0> is archivebot a thing to run at this time?
[22:10] <yipdw> it's online, yes
[22:43] <closure> omf_: uploading now
[22:51] <closure> http://ia801006.us.archive.org/33/items/healthcare-gov-gitrepo/SHA1E-s4715982--dc21e50fd159fb228da9f8c06fecb6f2e0681575.gov.git
[23:40] <closure> anyone seen this before? https://conservatory.github.io/
[23:41] <closure> (and no, I don't mean the broken ssl cert for github.io, although that's pretty funny)
[23:43] * closure kicks himself for not having run github-backup in that repo
[23:50] <balrog> which repo?