[00:03] Sean Carruthers: http://www.flickr.com/photos/globalhermit/ [01:14] http://tag3ulp55xczs3pn.onion needs to be archived [01:14] someone spammed the forum with a ton of crap ;/ [01:15] but the data *is* still there [04:35] balrog: I grabbed the Requiem website: https://archive.org/details/RequiemArchiveBegun11Oct2013Meta.warc [16:06] Is there any way to direct wget+WARC to download directly into the WARC instead of saving the file? I've been getting a "File name too long" error when running some mirrors. [16:07] kyan, nope [16:08] omf_: is there any solution to that problem, other than manually grabbing every URL? [16:08] Not that I know of [16:11] omf_: afaik wget+warc will write the right thing into the warc file [16:11] i thought the wget lua version we use saves it directly with just temp file to be used [16:11] using the --output-document and --truncate-output options [16:12] Try it out and see if it works [16:14] hmm. I'm using the built in WARC output in wget 1.14. I'll do some tests… [16:27] Interesting… with --output-document, the file gets downloaded correctly, but the WARC file itself is missing the content: http://pastebin.com/npzP1c7v [16:44] i think wget lua is patched to fix that problem, the --truncate-output option modifies --output-document so it doesnt append to the existing file but just overwrites it as it goes [16:47] actually.. i just tried out your command with normal wget and my warc file contains content [16:49] how are looking at the warc file, what you pasted isn't the complete warc. since each record is a seperate gzip file, it might have stopped opening it on the first file [17:13] 7zip ultra compression rules, 10 12x [17:20] chfoo: I did another test. Here is: the command I ran (a script), the output of the command to the terminal, and all files that wget put in the folder: http://futuramerlin.com/Long-URL-test-12October2013.tar.gz [17:25] I'd say, assume as little as possible about wget-lua [17:25] It's not almighty [17:25] kyan: the warc file should be complete. inside it, i see the 302 redirects, the page's html, and the log file. [17:26] chfoo: Huh. I'll try extracting it another way… weird [17:27] Now it's working. [17:27] * kyan is thoroughly confused now [17:27] I used gunzip to extract it and it was fine… using graphical archive utilities truncated the file [17:30] "warc.gz" is a misnomer [18:14] kyan: yes [18:14] kyan: oh wait, chfoo already answered that [18:16] kyan: for future reference, a gzipped WARC is a sequence of individually gzipped WARC records; this is legal, but a lot of utilities will bork it [18:17] the reason why that's done is for read/write efficiency: it's much easier to seek to a record that way than if you gzipped everything, and it's much more efficient to compress+append when generating a WARC from a network stream [18:18] yipdw: I see, thanks. That makes sense. [18:23] Hmm. I tried --truncate-output with the latest wget-lua from github, but it says it's an unrecognized option… [18:29] kyan: it's on the "lua" branch if you haven't seen it yet [18:31] alternatively, you can just use normal wget and manually truncate the output file it you start running out of disk space [18:31] chfoo: oh thanks sorry I was looking at master [18:32] i'll fix the wiki and default branch since i made the same mistake a few times [18:33] chfoo: Disk space hasn't been an issue… just the long file name issue. [18:36] kyan: oh, ok. the choice is up to you. wget-lua should be ok, there hasn't been an issue about it so far [18:38] chfoo: Cool. Thanks. I'm just going to keep trying things until I arrive at what works :D [20:05] Oh hey, my upload speed just tripled. Sweet. [20:36] so, blip's getting screwed over, huh? it's like viddler all over again with these content wipes [20:39] We got a project up and running for blip.tv. Feel free to contribute [20:44] got two warrior instances up and running [20:50] * closure notices he still has a git clone of heathcare.gov.. guess that puppy is going to archive.org now that they've nuked it from github.. [20:58] you have a git version of heathcare.gov? [20:58] wait, that was on github at one point? [21:39] closure, put it up on IA, I have people asking me for it [22:04] is archivebot a thing to run at this time? [22:10] it's online, yes [22:43] omf_: uploading now [22:51] http://ia801006.us.archive.org/33/items/healthcare-gov-gitrepo/SHA1E-s4715982--dc21e50fd159fb228da9f8c06fecb6f2e0681575.gov.git [23:40] anyone seen this before? https://conservatory.github.io/ [23:41] (and no, I don't mean the broken ssl cert for github.io, although that's pretty funny) [23:43] * closure kicks himself for not having run github-backup in that repo [23:50] which repo?