| Time |
Nickname |
Message |
|
00:03
π
|
godane |
Sean Carruthers: http://www.flickr.com/photos/globalhermit/ |
|
01:14
π
|
balrog |
http://tag3ulp55xczs3pn.onion needs to be archived |
|
01:14
π
|
balrog |
someone spammed the forum with a ton of crap ;/ |
|
01:15
π
|
balrog |
but the data *is* still there |
|
04:35
π
|
kyan |
balrog: I grabbed the Requiem website: https://archive.org/details/RequiemArchiveBegun11Oct2013Meta.warc |
|
16:06
π
|
kyan |
Is there any way to direct wget+WARC to download directly into the WARC instead of saving the file? I've been getting a "File name too long" error when running some mirrors. |
|
16:07
π
|
omf_ |
kyan, nope |
|
16:08
π
|
kyan |
omf_: is there any solution to that problem, other than manually grabbing every URL? |
|
16:08
π
|
omf_ |
Not that I know of |
|
16:11
π
|
balrog |
omf_: afaik wget+warc will write the right thing into the warc file |
|
16:11
π
|
chfoo |
i thought the wget lua version we use saves it directly with just temp file to be used |
|
16:11
π
|
chfoo |
using the --output-document and --truncate-output options |
|
16:12
π
|
omf_ |
Try it out and see if it works |
|
16:14
π
|
kyan |
hmm. I'm using the built in WARC output in wget 1.14. I'll do some testsΓ’ΒΒ¦ |
|
16:27
π
|
kyan |
InterestingΓ’ΒΒ¦ with --output-document, the file gets downloaded correctly, but the WARC file itself is missing the content: http://pastebin.com/npzP1c7v |
|
16:44
π
|
chfoo |
i think wget lua is patched to fix that problem, the --truncate-output option modifies --output-document so it doesnt append to the existing file but just overwrites it as it goes |
|
16:47
π
|
chfoo |
actually.. i just tried out your command with normal wget and my warc file contains content |
|
16:49
π
|
chfoo |
how are looking at the warc file, what you pasted isn't the complete warc. since each record is a seperate gzip file, it might have stopped opening it on the first file |
|
17:13
π
|
bsmith094 |
7zip ultra compression rules, 10 12x |
|
17:20
π
|
kyan |
chfoo: I did another test. Here is: the command I ran (a script), the output of the command to the terminal, and all files that wget put in the folder: http://futuramerlin.com/Long-URL-test-12October2013.tar.gz |
|
17:25
π
|
ersi |
I'd say, assume as little as possible about wget-lua |
|
17:25
π
|
ersi |
It's not almighty |
|
17:25
π
|
chfoo |
kyan: the warc file should be complete. inside it, i see the 302 redirects, the page's html, and the log file. |
|
17:26
π
|
kyan |
chfoo: Huh. I'll try extracting it another wayΓ’ΒΒ¦ weird |
|
17:27
π
|
kyan |
Now it's working. |
|
17:27
π
|
* |
kyan is thoroughly confused now |
|
17:27
π
|
kyan |
I used gunzip to extract it and it was fineΓ’ΒΒ¦ using graphical archive utilities truncated the file |
|
17:30
π
|
chfoo |
"warc.gz" is a misnomer |
|
18:14
π
|
yipdw |
kyan: yes |
|
18:14
π
|
yipdw |
kyan: oh wait, chfoo already answered that |
|
18:16
π
|
yipdw |
kyan: for future reference, a gzipped WARC is a sequence of individually gzipped WARC records; this is legal, but a lot of utilities will bork it |
|
18:17
π
|
yipdw |
the reason why that's done is for read/write efficiency: it's much easier to seek to a record that way than if you gzipped everything, and it's much more efficient to compress+append when generating a WARC from a network stream |
|
18:18
π
|
kyan |
yipdw: I see, thanks. That makes sense. |
|
18:23
π
|
kyan |
Hmm. I tried --truncate-output with the latest wget-lua from github, but it says it's an unrecognized optionΓ’ΒΒ¦ |
|
18:29
π
|
chfoo |
kyan: it's on the "lua" branch if you haven't seen it yet |
|
18:31
π
|
chfoo |
alternatively, you can just use normal wget and manually truncate the output file it you start running out of disk space |
|
18:31
π
|
kyan |
chfoo: oh thanks sorry I was looking at master |
|
18:32
π
|
chfoo |
i'll fix the wiki and default branch since i made the same mistake a few times |
|
18:33
π
|
kyan |
chfoo: Disk space hasn't been an issueΓ’ΒΒ¦ just the long file name issue. |
|
18:36
π
|
chfoo |
kyan: oh, ok. the choice is up to you. wget-lua should be ok, there hasn't been an issue about it so far |
|
18:38
π
|
kyan |
chfoo: Cool. Thanks. I'm just going to keep trying things until I arrive at what works :D |
|
20:05
π
|
Sellyme |
Oh hey, my upload speed just tripled. Sweet. |
|
20:36
π
|
robv |
so, blip's getting screwed over, huh? it's like viddler all over again with these content wipes |
|
20:39
π
|
ersi_ |
We got a project up and running for blip.tv. Feel free to contribute |
|
20:44
π
|
robv |
got two warrior instances up and running |
|
20:50
π
|
* |
closure notices he still has a git clone of heathcare.gov.. guess that puppy is going to archive.org now that they've nuked it from github.. |
|
20:58
π
|
godane |
you have a git version of heathcare.gov? |
|
20:58
π
|
yipdw |
wait, that was on github at one point? |
|
21:39
π
|
omf_ |
closure, put it up on IA, I have people asking me for it |
|
22:04
π
|
andy0 |
is archivebot a thing to run at this time? |
|
22:10
π
|
yipdw |
it's online, yes |
|
22:43
π
|
closure |
omf_: uploading now |
|
22:51
π
|
closure |
http://ia801006.us.archive.org/33/items/healthcare-gov-gitrepo/SHA1E-s4715982--dc21e50fd159fb228da9f8c06fecb6f2e0681575.gov.git |
|
23:40
π
|
closure |
anyone seen this before? https://conservatory.github.io/ |
|
23:41
π
|
closure |
(and no, I don't mean the broken ssl cert for github.io, although that's pretty funny) |
|
23:43
π
|
* |
closure kicks himself for not having run github-backup in that repo |
|
23:50
π
|
balrog |
which repo? |