[02:48] <omf_> frontaalnaakt.nl is almost done uploading. Another site saved from religion
[07:11] <PepsiMax> omf_: as a Dutchie: risky click.
[07:17] <newbie13> damn the DB zip files still dont work :(
[07:43] <Nemo_bis> Hm, this is not working that well, is it? http://dsss.be/newegg-hard-drive-cost/
[10:58] <SketchCow> http://www.edwardbetts.com/price_per_tb/ is what I use
[11:10] <godane> so i got most of the web only towel talk from techtv
[11:10] <godane> thanks to this: http://web.archive.org/web/20030210160905/http://www.techtv.com/screensavers/aboutus/story/0,24330,3402140,00.html
[11:11] <godane> and yes they were interview with a towel
[11:11] <godane> but bad news is patrick nortan interview has no audio from what i can tell
[15:29] <SketchCow> FOS is going read-only.
[15:29] <SketchCow> We're getting a new machine!
[15:30] <SketchCow> Now, what to name it
[15:32] <SketchCow> Nailed
[15:32] <SketchCow> Honeycomb Hideout
[15:48] <Smiley> D:
[15:48] <Smiley> "FUCK YOU".
[15:49] <Smiley> Fata than foo.
[16:25] <godane> SketchCow: so FOS is almost died?
[16:26] <godane> must be trying to mirror everything to archive.org has fast has possible then
[18:02] <Smiley> godane: it's not that, it's drive failures or possibly faulty hardware faking drive failures.
[18:15] <godane> ok
[18:15] <godane> but still
[18:15] <godane> mirror it to IA
[18:36] <SketchCow> omf_: Your grabs of sites are not working, and are not deriving.
[18:36] <SketchCow> glitch.com, rogerebert.com and gamasutra.com have not worked.
[18:37] <Smiley> 1D:
[18:37] <Smiley> wtf
[18:39] <omf_> All I did was use wget 1.14 and those sites probably have link cancer in them
[18:40] <SketchCow> http://www-tracey.us.archive.org/log_show.php?task_id=153967762
[18:42] <DFJustin> that link isn't public, try https://www.us.archive.org/log_show.php?task_id=153967762
[18:42] <omf_> It worked for me, probably because I am logged in
[18:45] <omf_> Is the derive code online? I checked https://github.com/internetarchive/ but couldn't find a project for it
[18:49] <omf_> Is this https://github.com/internetarchive/CDX-Writer up to date?
[18:50] <DFJustin> https://github.com/rajbot/CDX-Writer looks newer
[18:51] <Smiley> Adding to fail-reasons list: CDXIndex:gzip fail:gamasutra.warc.gz ...
[19:02] <omf_> Okay so I am trying to run cdx_writer.py from https://github.com/rajbot/CDX-Writer to see if I can get some more information locally.
[19:02] <omf_> The problem is there are no docs for how to do this. So poking around I find I need this dependency https://bitbucket.org/rajbot/warc-tools/overview but when I git clone it, there is a server error
[19:02] <Smiley> zlib.error: Error -3 while decompressing: incorrect header check :/ what does that even mean :<
[19:02] <omf_> I know what that means
[19:03] <omf_> warc.gz are a collection of warc records that are gz compressed
[19:03] <omf_> now gz files can have multiple separate entries
[19:04] <Smiley> so do they need splitting up or something to fix?
[19:04] <omf_> the last entry (I assume since I cannot get the tool running yet) is truncated and thus throwing the error. What I wonder is why there is no recovery for an issue like this when looking at the test suite shows there was some serious effort put in
[19:07] <alard> I get gzip: rogerebert.com.warc.gz: decompression OK, trailing garbage ignored , so I guess there's something missing at the end.
[19:07] <omf_> The specification for WARC itself has no mention of handling corrupt records, recovery, or anything dealing with broken files. {sigh}
[19:07] <omf_> alard, That is my take away as well
[19:08] <alard> Should that be included in the specification?
[19:10] <omf_> Well if you look at the gzip spec they have language about checking for errors in compliance tests as well as data verification
[19:10] <omf_> How tools should handle errors
[19:11] <alard> Is that rogerebert.com.warc.gz one warc or was it stitched together?
[19:11] <omf_> None of them were stiched together. I just ran wget and uploaded them when wget finished
[19:12] <alard> The log record at the end is missing in my uncompressed rogerebert.warc.
[19:13] <omf_> So how do you fix that? We have an existing tool
[19:17] <alard> In this case you could keep everything until the last gzip/warc record.
[19:17] <omf_> yep
[19:18] <omf_> I am kinda surprised this has not come up before
[19:19] <alard> It has. We have unfinished warcs. The megawarc builder checks for this and puts those warcs in the tar file.
[19:20] <omf_> But we don't have a way to fix this?
[19:20] <alard> But you shouldn't be doing it with every warc. That's strange.
[19:20] <omf_> It was 3 out like 20 so far
[19:20] <alard> No. No script.
[19:20] <omf_> and I got a few hundred more to upload
[19:21] <alard> This shouldn't happen if Wget works normally and doesn't exit halfway.
[19:22] <omf_> I agree
[19:24] <alard> Is there a standalone, easy to run cdx generator?
[19:26] <omf_> That is what I am looking for. I am thinking about opening some bug reports, see if I can help fix shit up
[19:27] <omf_> I should have a way to check and fix warcs before they are even uploaded
[19:29] <alard> All warcs generated with Wgets older than the very very latest Wget-git (or Wget+Lua) are somewhat broken.
[19:29] <alard> It's just that most tools don't see it.
[19:30] <omf_> chfoo, Mentioned that as well
[19:39] <alard> This is from the header of the last record in the rogerebert warc:
[19:39] <alard> 00000340  4c 4f da 02 00 00 58 58  58 58 58 58 58 58 58 58  |LO....XXXXXXXXXX|
[19:39] <alard> 00000350  58 58 1f 8b 08 00 00 00  00 00 02 03 d4 bd eb 72  |XX.............r|
[19:39] <alard> 00000360  1d 47 92 26 f8 5f 66 fd  0e e7 4f cb 34 66 cb 8c  |.G.&._f...O.4f..|
[19:39] <alard> The X's are placeholders that Wget fills up after writing the whole record, so apparently it never got that far.
[19:42] <omf_> I am going to put all my warc information online
[19:42] <omf_> Is the wiki the best place?
[19:42] <alard> Yes, I think so.
[19:43] <alard> This gives you that final record:  tail -c +1161896909 rogerebert.com.warc.gz | gunzip -c | less   It's the Wget log, but incomplete. You should have gotten an error. Disk full?
[19:44] <omf_> no idea
[19:53] <omf_> Here is the bulk of my warc information - http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
[19:54] <omf_> I am going to add a file formation section once I finish typing it up
[20:05] <chfoo> what did i mention? wget makes duplicate record ids?
[20:05] <omf_> yes
[20:07] <chfoo> while you're at it, can you add my warc tool to the wiki as well :)
[20:07] <omf_> I just added a WARC format section
[20:08] <omf_> no problem chazchaz
[20:08] <omf_> I mean chfoo
[20:10] <omf_> I know there are some details missing, feel free to add them in. I want this page to be the only thing someone has to read to master warc files
[20:13] <alard> chfoo: Duplicate record IDs? That's new for me.
[20:15] <alard> They're supposedly unique UUIDs.
[20:16] <chfoo> it generates two resources records for MANIFEST.txt and wget_arguments.txt but the id is the same
[20:22] <SketchCow> ONE TWO ARCHIVE TEAM MEMBER APPEARANCES ON CBC SHOW: http://www.cbc.ca/spark/episodes/2013/04/12/213-data-longevity-integrative-thinking-virtual-staging/
[20:22] <SketchCow> Take that, world
[20:27] <godane> just passed 29k videos for g4video-web collection
[20:31] <omf_> chfoo, I could not find any licensing info
[20:32] <chfoo> omf_: it's GPL v3
[20:34] <godane> so i look for spark podcast in IA and it doesn't really exist in
[20:34] <alard> https://github.com/alard/CDX-Writer/compare/ignore-invalid-gzip-headers
[20:35] <godane> with over 200+ episodes i think i will slowly start mirroring that
[20:40] <omf_> alard, Wouldn't forking from https://github.com/rajbot/CDX-Writer be better since the internetarchive one is a out of date fork of that?
[20:40] <omf_> Then again I do not know which version is used by IA at present
[20:42] <SketchCow> Oh my god, I want to punch the "right to forget" person in this blog
[21:18] <godane> so i have a little problem with cbc spark descs
[21:18] <godane> it has more then one line
[21:26] <omf_> chfoo, alard Anything major about the warc and cdx file formats missing from the wiki? I am trying to make it a big checklist so a developer can follow it and work with warcs
[21:33] <eadler> SketchCow: which blog ?
[21:34] <godane> first episode of spark uploaded: https://archive.org/details/spark_20070905_3205
[21:50] <arkhive> Did anyone ever save Minitel? (if it was savable)
[21:50] <arkhive> I asked this a month ago and never saw the answer because i disconnected.
[21:57] <omf_> Thanks for the additions alard keep em rolling in :)
[22:02] <omf_> I just listened to cbc show. I agree SketchCow the right to forget proponent is a fool
[22:10] <omf_> An international privacy expert who does not understand the web
[22:28] <omf_> Okay I got 13 tools for dealing with warc files on here http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem . What tools are missing? What information would you like to see on there?
[22:29] <omf_> Any other key metrics of the software we should be tracking? I got license, language, testing, docs, # of authors
[22:32] <godane> so you guys will soon have all epsidoes of cbc spark podcast for 2007
[22:32] <godane> at least the one i can find
[22:32] <godane> episode 2 and 3 are gone i guess
[22:41] <arkhive> err.. connection messed up again
[22:44] <dashcloud> omf_: actually, I'd like to see an example that anyone could use to archive a site and make a WARC suitable for putting into the Wayback machine