Time |
Nickname |
Message |
02:48
🔗
|
omf_ |
frontaalnaakt.nl is almost done uploading. Another site saved from religion |
07:11
🔗
|
PepsiMax |
omf_: as a Dutchie: risky click. |
07:17
🔗
|
newbie13 |
damn the DB zip files still dont work :( |
07:43
🔗
|
Nemo_bis |
Hm, this is not working that well, is it? http://dsss.be/newegg-hard-drive-cost/ |
10:58
🔗
|
SketchCow |
http://www.edwardbetts.com/price_per_tb/ is what I use |
11:10
🔗
|
godane |
so i got most of the web only towel talk from techtv |
11:10
🔗
|
godane |
thanks to this: http://web.archive.org/web/20030210160905/http://www.techtv.com/screensavers/aboutus/story/0,24330,3402140,00.html |
11:11
🔗
|
godane |
and yes they were interview with a towel |
11:11
🔗
|
godane |
but bad news is patrick nortan interview has no audio from what i can tell |
15:29
🔗
|
SketchCow |
FOS is going read-only. |
15:29
🔗
|
SketchCow |
We're getting a new machine! |
15:30
🔗
|
SketchCow |
Now, what to name it |
15:32
🔗
|
SketchCow |
Nailed |
15:32
🔗
|
SketchCow |
Honeycomb Hideout |
15:48
🔗
|
Smiley |
D: |
15:48
🔗
|
Smiley |
"FUCK YOU". |
15:49
🔗
|
Smiley |
Fata than foo. |
16:25
🔗
|
godane |
SketchCow: so FOS is almost died? |
16:26
🔗
|
godane |
must be trying to mirror everything to archive.org has fast has possible then |
18:02
🔗
|
Smiley |
godane: it's not that, it's drive failures or possibly faulty hardware faking drive failures. |
18:15
🔗
|
godane |
ok |
18:15
🔗
|
godane |
but still |
18:15
🔗
|
godane |
mirror it to IA |
18:36
🔗
|
SketchCow |
omf_: Your grabs of sites are not working, and are not deriving. |
18:36
🔗
|
SketchCow |
glitch.com, rogerebert.com and gamasutra.com have not worked. |
18:37
🔗
|
Smiley |
1D: |
18:37
🔗
|
Smiley |
wtf |
18:39
🔗
|
omf_ |
All I did was use wget 1.14 and those sites probably have link cancer in them |
18:40
🔗
|
SketchCow |
http://www-tracey.us.archive.org/log_show.php?task_id=153967762 |
18:42
🔗
|
DFJustin |
that link isn't public, try https://www.us.archive.org/log_show.php?task_id=153967762 |
18:42
🔗
|
omf_ |
It worked for me, probably because I am logged in |
18:45
🔗
|
omf_ |
Is the derive code online? I checked https://github.com/internetarchive/ but couldn't find a project for it |
18:49
🔗
|
omf_ |
Is this https://github.com/internetarchive/CDX-Writer up to date? |
18:50
🔗
|
DFJustin |
https://github.com/rajbot/CDX-Writer looks newer |
18:51
🔗
|
Smiley |
Adding to fail-reasons list: CDXIndex:gzip fail:gamasutra.warc.gz ... |
19:02
🔗
|
omf_ |
Okay so I am trying to run cdx_writer.py from https://github.com/rajbot/CDX-Writer to see if I can get some more information locally. |
19:02
🔗
|
omf_ |
The problem is there are no docs for how to do this. So poking around I find I need this dependency https://bitbucket.org/rajbot/warc-tools/overview but when I git clone it, there is a server error |
19:02
🔗
|
Smiley |
zlib.error: Error -3 while decompressing: incorrect header check :/ what does that even mean :< |
19:02
🔗
|
omf_ |
I know what that means |
19:03
🔗
|
omf_ |
warc.gz are a collection of warc records that are gz compressed |
19:03
🔗
|
omf_ |
now gz files can have multiple separate entries |
19:04
🔗
|
Smiley |
so do they need splitting up or something to fix? |
19:04
🔗
|
omf_ |
the last entry (I assume since I cannot get the tool running yet) is truncated and thus throwing the error. What I wonder is why there is no recovery for an issue like this when looking at the test suite shows there was some serious effort put in |
19:07
🔗
|
alard |
I get gzip: rogerebert.com.warc.gz: decompression OK, trailing garbage ignored , so I guess there's something missing at the end. |
19:07
🔗
|
omf_ |
The specification for WARC itself has no mention of handling corrupt records, recovery, or anything dealing with broken files. {sigh} |
19:07
🔗
|
omf_ |
alard, That is my take away as well |
19:08
🔗
|
alard |
Should that be included in the specification? |
19:10
🔗
|
omf_ |
Well if you look at the gzip spec they have language about checking for errors in compliance tests as well as data verification |
19:10
🔗
|
omf_ |
How tools should handle errors |
19:11
🔗
|
alard |
Is that rogerebert.com.warc.gz one warc or was it stitched together? |
19:11
🔗
|
omf_ |
None of them were stiched together. I just ran wget and uploaded them when wget finished |
19:12
🔗
|
alard |
The log record at the end is missing in my uncompressed rogerebert.warc. |
19:13
🔗
|
omf_ |
So how do you fix that? We have an existing tool |
19:17
🔗
|
alard |
In this case you could keep everything until the last gzip/warc record. |
19:17
🔗
|
omf_ |
yep |
19:18
🔗
|
omf_ |
I am kinda surprised this has not come up before |
19:19
🔗
|
alard |
It has. We have unfinished warcs. The megawarc builder checks for this and puts those warcs in the tar file. |
19:20
🔗
|
omf_ |
But we don't have a way to fix this? |
19:20
🔗
|
alard |
But you shouldn't be doing it with every warc. That's strange. |
19:20
🔗
|
omf_ |
It was 3 out like 20 so far |
19:20
🔗
|
alard |
No. No script. |
19:20
🔗
|
omf_ |
and I got a few hundred more to upload |
19:21
🔗
|
alard |
This shouldn't happen if Wget works normally and doesn't exit halfway. |
19:22
🔗
|
omf_ |
I agree |
19:24
🔗
|
alard |
Is there a standalone, easy to run cdx generator? |
19:26
🔗
|
omf_ |
That is what I am looking for. I am thinking about opening some bug reports, see if I can help fix shit up |
19:27
🔗
|
omf_ |
I should have a way to check and fix warcs before they are even uploaded |
19:29
🔗
|
alard |
All warcs generated with Wgets older than the very very latest Wget-git (or Wget+Lua) are somewhat broken. |
19:29
🔗
|
alard |
It's just that most tools don't see it. |
19:30
🔗
|
omf_ |
chfoo, Mentioned that as well |
19:39
🔗
|
alard |
This is from the header of the last record in the rogerebert warc: |
19:39
🔗
|
alard |
00000340 4c 4f da 02 00 00 58 58 58 58 58 58 58 58 58 58 |LO....XXXXXXXXXX| |
19:39
🔗
|
alard |
00000350 58 58 1f 8b 08 00 00 00 00 00 02 03 d4 bd eb 72 |XX.............r| |
19:39
🔗
|
alard |
00000360 1d 47 92 26 f8 5f 66 fd 0e e7 4f cb 34 66 cb 8c |.G.&._f...O.4f..| |
19:39
🔗
|
alard |
The X's are placeholders that Wget fills up after writing the whole record, so apparently it never got that far. |
19:42
🔗
|
omf_ |
I am going to put all my warc information online |
19:42
🔗
|
omf_ |
Is the wiki the best place? |
19:42
🔗
|
alard |
Yes, I think so. |
19:43
🔗
|
alard |
This gives you that final record: tail -c +1161896909 rogerebert.com.warc.gz | gunzip -c | less It's the Wget log, but incomplete. You should have gotten an error. Disk full? |
19:44
🔗
|
omf_ |
no idea |
19:53
🔗
|
omf_ |
Here is the bulk of my warc information - http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem |
19:54
🔗
|
omf_ |
I am going to add a file formation section once I finish typing it up |
20:05
🔗
|
chfoo |
what did i mention? wget makes duplicate record ids? |
20:05
🔗
|
omf_ |
yes |
20:07
🔗
|
chfoo |
while you're at it, can you add my warc tool to the wiki as well :) |
20:07
🔗
|
omf_ |
I just added a WARC format section |
20:08
🔗
|
omf_ |
no problem chazchaz |
20:08
🔗
|
omf_ |
I mean chfoo |
20:10
🔗
|
omf_ |
I know there are some details missing, feel free to add them in. I want this page to be the only thing someone has to read to master warc files |
20:13
🔗
|
alard |
chfoo: Duplicate record IDs? That's new for me. |
20:15
🔗
|
alard |
They're supposedly unique UUIDs. |
20:16
🔗
|
chfoo |
it generates two resources records for MANIFEST.txt and wget_arguments.txt but the id is the same |
20:22
🔗
|
SketchCow |
ONE TWO ARCHIVE TEAM MEMBER APPEARANCES ON CBC SHOW: http://www.cbc.ca/spark/episodes/2013/04/12/213-data-longevity-integrative-thinking-virtual-staging/ |
20:22
🔗
|
SketchCow |
Take that, world |
20:27
🔗
|
godane |
just passed 29k videos for g4video-web collection |
20:31
🔗
|
omf_ |
chfoo, I could not find any licensing info |
20:32
🔗
|
chfoo |
omf_: it's GPL v3 |
20:34
🔗
|
godane |
so i look for spark podcast in IA and it doesn't really exist in |
20:34
🔗
|
alard |
https://github.com/alard/CDX-Writer/compare/ignore-invalid-gzip-headers |
20:35
🔗
|
godane |
with over 200+ episodes i think i will slowly start mirroring that |
20:40
🔗
|
omf_ |
alard, Wouldn't forking from https://github.com/rajbot/CDX-Writer be better since the internetarchive one is a out of date fork of that? |
20:40
🔗
|
omf_ |
Then again I do not know which version is used by IA at present |
20:42
🔗
|
SketchCow |
Oh my god, I want to punch the "right to forget" person in this blog |
21:18
🔗
|
godane |
so i have a little problem with cbc spark descs |
21:18
🔗
|
godane |
it has more then one line |
21:26
🔗
|
omf_ |
chfoo, alard Anything major about the warc and cdx file formats missing from the wiki? I am trying to make it a big checklist so a developer can follow it and work with warcs |
21:33
🔗
|
eadler |
SketchCow: which blog ? |
21:34
🔗
|
godane |
first episode of spark uploaded: https://archive.org/details/spark_20070905_3205 |
21:50
🔗
|
arkhive |
Did anyone ever save Minitel? (if it was savable) |
21:50
🔗
|
arkhive |
I asked this a month ago and never saw the answer because i disconnected. |
21:57
🔗
|
omf_ |
Thanks for the additions alard keep em rolling in :) |
22:02
🔗
|
omf_ |
I just listened to cbc show. I agree SketchCow the right to forget proponent is a fool |
22:10
🔗
|
omf_ |
An international privacy expert who does not understand the web |
22:28
🔗
|
omf_ |
Okay I got 13 tools for dealing with warc files on here http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem . What tools are missing? What information would you like to see on there? |
22:29
🔗
|
omf_ |
Any other key metrics of the software we should be tracking? I got license, language, testing, docs, # of authors |
22:32
🔗
|
godane |
so you guys will soon have all epsidoes of cbc spark podcast for 2007 |
22:32
🔗
|
godane |
at least the one i can find |
22:32
🔗
|
godane |
episode 2 and 3 are gone i guess |
22:41
🔗
|
arkhive |
err.. connection messed up again |
22:44
🔗
|
dashcloud |
omf_: actually, I'd like to see an example that anyone could use to archive a site and make a WARC suitable for putting into the Wayback machine |