Time |
Nickname |
Message |
00:03
π
|
godane |
Sean Carruthers: http://www.flickr.com/photos/globalhermit/ |
01:14
π
|
balrog |
http://tag3ulp55xczs3pn.onion needs to be archived |
01:14
π
|
balrog |
someone spammed the forum with a ton of crap ;/ |
01:15
π
|
balrog |
but the data *is* still there |
04:35
π
|
kyan |
balrog: I grabbed the Requiem website: https://archive.org/details/RequiemArchiveBegun11Oct2013Meta.warc |
16:06
π
|
kyan |
Is there any way to direct wget+WARC to download directly into the WARC instead of saving the file? I've been getting a "File name too long" error when running some mirrors. |
16:07
π
|
omf_ |
kyan, nope |
16:08
π
|
kyan |
omf_: is there any solution to that problem, other than manually grabbing every URL? |
16:08
π
|
omf_ |
Not that I know of |
16:11
π
|
balrog |
omf_: afaik wget+warc will write the right thing into the warc file |
16:11
π
|
chfoo |
i thought the wget lua version we use saves it directly with just temp file to be used |
16:11
π
|
chfoo |
using the --output-document and --truncate-output options |
16:12
π
|
omf_ |
Try it out and see if it works |
16:14
π
|
kyan |
hmm. I'm using the built in WARC output in wget 1.14. I'll do some testsΓ’ΒΒ¦ |
16:27
π
|
kyan |
InterestingΓ’ΒΒ¦ with --output-document, the file gets downloaded correctly, but the WARC file itself is missing the content: http://pastebin.com/npzP1c7v |
16:44
π
|
chfoo |
i think wget lua is patched to fix that problem, the --truncate-output option modifies --output-document so it doesnt append to the existing file but just overwrites it as it goes |
16:47
π
|
chfoo |
actually.. i just tried out your command with normal wget and my warc file contains content |
16:49
π
|
chfoo |
how are looking at the warc file, what you pasted isn't the complete warc. since each record is a seperate gzip file, it might have stopped opening it on the first file |
17:13
π
|
bsmith094 |
7zip ultra compression rules, 10 12x |
17:20
π
|
kyan |
chfoo: I did another test. Here is: the command I ran (a script), the output of the command to the terminal, and all files that wget put in the folder: http://futuramerlin.com/Long-URL-test-12October2013.tar.gz |
17:25
π
|
ersi |
I'd say, assume as little as possible about wget-lua |
17:25
π
|
ersi |
It's not almighty |
17:25
π
|
chfoo |
kyan: the warc file should be complete. inside it, i see the 302 redirects, the page's html, and the log file. |
17:26
π
|
kyan |
chfoo: Huh. I'll try extracting it another wayΓ’ΒΒ¦ weird |
17:27
π
|
kyan |
Now it's working. |
17:27
π
|
* |
kyan is thoroughly confused now |
17:27
π
|
kyan |
I used gunzip to extract it and it was fineΓ’ΒΒ¦ using graphical archive utilities truncated the file |
17:30
π
|
chfoo |
"warc.gz" is a misnomer |
18:14
π
|
yipdw |
kyan: yes |
18:14
π
|
yipdw |
kyan: oh wait, chfoo already answered that |
18:16
π
|
yipdw |
kyan: for future reference, a gzipped WARC is a sequence of individually gzipped WARC records; this is legal, but a lot of utilities will bork it |
18:17
π
|
yipdw |
the reason why that's done is for read/write efficiency: it's much easier to seek to a record that way than if you gzipped everything, and it's much more efficient to compress+append when generating a WARC from a network stream |
18:18
π
|
kyan |
yipdw: I see, thanks. That makes sense. |
18:23
π
|
kyan |
Hmm. I tried --truncate-output with the latest wget-lua from github, but it says it's an unrecognized optionΓ’ΒΒ¦ |
18:29
π
|
chfoo |
kyan: it's on the "lua" branch if you haven't seen it yet |
18:31
π
|
chfoo |
alternatively, you can just use normal wget and manually truncate the output file it you start running out of disk space |
18:31
π
|
kyan |
chfoo: oh thanks sorry I was looking at master |
18:32
π
|
chfoo |
i'll fix the wiki and default branch since i made the same mistake a few times |
18:33
π
|
kyan |
chfoo: Disk space hasn't been an issueΓ’ΒΒ¦ just the long file name issue. |
18:36
π
|
chfoo |
kyan: oh, ok. the choice is up to you. wget-lua should be ok, there hasn't been an issue about it so far |
18:38
π
|
kyan |
chfoo: Cool. Thanks. I'm just going to keep trying things until I arrive at what works :D |
20:05
π
|
Sellyme |
Oh hey, my upload speed just tripled. Sweet. |
20:36
π
|
robv |
so, blip's getting screwed over, huh? it's like viddler all over again with these content wipes |
20:39
π
|
ersi_ |
We got a project up and running for blip.tv. Feel free to contribute |
20:44
π
|
robv |
got two warrior instances up and running |
20:50
π
|
* |
closure notices he still has a git clone of heathcare.gov.. guess that puppy is going to archive.org now that they've nuked it from github.. |
20:58
π
|
godane |
you have a git version of heathcare.gov? |
20:58
π
|
yipdw |
wait, that was on github at one point? |
21:39
π
|
omf_ |
closure, put it up on IA, I have people asking me for it |
22:04
π
|
andy0 |
is archivebot a thing to run at this time? |
22:10
π
|
yipdw |
it's online, yes |
22:43
π
|
closure |
omf_: uploading now |
22:51
π
|
closure |
http://ia801006.us.archive.org/33/items/healthcare-gov-gitrepo/SHA1E-s4715982--dc21e50fd159fb228da9f8c06fecb6f2e0681575.gov.git |
23:40
π
|
closure |
anyone seen this before? https://conservatory.github.io/ |
23:41
π
|
closure |
(and no, I don't mean the broken ssl cert for github.io, although that's pretty funny) |
23:43
π
|
* |
closure kicks himself for not having run github-backup in that repo |
23:50
π
|
balrog |
which repo? |