[02:48] frontaalnaakt.nl is almost done uploading. Another site saved from religion [07:11] omf_: as a Dutchie: risky click. [07:17] damn the DB zip files still dont work :( [07:43] Hm, this is not working that well, is it? http://dsss.be/newegg-hard-drive-cost/ [10:58] http://www.edwardbetts.com/price_per_tb/ is what I use [11:10] so i got most of the web only towel talk from techtv [11:10] thanks to this: http://web.archive.org/web/20030210160905/http://www.techtv.com/screensavers/aboutus/story/0,24330,3402140,00.html [11:11] and yes they were interview with a towel [11:11] but bad news is patrick nortan interview has no audio from what i can tell [15:29] FOS is going read-only. [15:29] We're getting a new machine! [15:30] Now, what to name it [15:32] Nailed [15:32] Honeycomb Hideout [15:48] D: [15:48] "FUCK YOU". [15:49] Fata than foo. [16:25] SketchCow: so FOS is almost died? [16:26] must be trying to mirror everything to archive.org has fast has possible then [18:02] godane: it's not that, it's drive failures or possibly faulty hardware faking drive failures. [18:15] ok [18:15] but still [18:15] mirror it to IA [18:36] omf_: Your grabs of sites are not working, and are not deriving. [18:36] glitch.com, rogerebert.com and gamasutra.com have not worked. [18:37] 1D: [18:37] wtf [18:39] All I did was use wget 1.14 and those sites probably have link cancer in them [18:40] http://www-tracey.us.archive.org/log_show.php?task_id=153967762 [18:42] that link isn't public, try https://www.us.archive.org/log_show.php?task_id=153967762 [18:42] It worked for me, probably because I am logged in [18:45] Is the derive code online? I checked https://github.com/internetarchive/ but couldn't find a project for it [18:49] Is this https://github.com/internetarchive/CDX-Writer up to date? [18:50] https://github.com/rajbot/CDX-Writer looks newer [18:51] Adding to fail-reasons list: CDXIndex:gzip fail:gamasutra.warc.gz ... [19:02] Okay so I am trying to run cdx_writer.py from https://github.com/rajbot/CDX-Writer to see if I can get some more information locally. [19:02] The problem is there are no docs for how to do this. So poking around I find I need this dependency https://bitbucket.org/rajbot/warc-tools/overview but when I git clone it, there is a server error [19:02] zlib.error: Error -3 while decompressing: incorrect header check :/ what does that even mean :< [19:02] I know what that means [19:03] warc.gz are a collection of warc records that are gz compressed [19:03] now gz files can have multiple separate entries [19:04] so do they need splitting up or something to fix? [19:04] the last entry (I assume since I cannot get the tool running yet) is truncated and thus throwing the error. What I wonder is why there is no recovery for an issue like this when looking at the test suite shows there was some serious effort put in [19:07] I get gzip: rogerebert.com.warc.gz: decompression OK, trailing garbage ignored , so I guess there's something missing at the end. [19:07] The specification for WARC itself has no mention of handling corrupt records, recovery, or anything dealing with broken files. {sigh} [19:07] alard, That is my take away as well [19:08] Should that be included in the specification? [19:10] Well if you look at the gzip spec they have language about checking for errors in compliance tests as well as data verification [19:10] How tools should handle errors [19:11] Is that rogerebert.com.warc.gz one warc or was it stitched together? [19:11] None of them were stiched together. I just ran wget and uploaded them when wget finished [19:12] The log record at the end is missing in my uncompressed rogerebert.warc. [19:13] So how do you fix that? We have an existing tool [19:17] In this case you could keep everything until the last gzip/warc record. [19:17] yep [19:18] I am kinda surprised this has not come up before [19:19] It has. We have unfinished warcs. The megawarc builder checks for this and puts those warcs in the tar file. [19:20] But we don't have a way to fix this? [19:20] But you shouldn't be doing it with every warc. That's strange. [19:20] It was 3 out like 20 so far [19:20] No. No script. [19:20] and I got a few hundred more to upload [19:21] This shouldn't happen if Wget works normally and doesn't exit halfway. [19:22] I agree [19:24] Is there a standalone, easy to run cdx generator? [19:26] That is what I am looking for. I am thinking about opening some bug reports, see if I can help fix shit up [19:27] I should have a way to check and fix warcs before they are even uploaded [19:29] All warcs generated with Wgets older than the very very latest Wget-git (or Wget+Lua) are somewhat broken. [19:29] It's just that most tools don't see it. [19:30] chfoo, Mentioned that as well [19:39] This is from the header of the last record in the rogerebert warc: [19:39] 00000340 4c 4f da 02 00 00 58 58 58 58 58 58 58 58 58 58 |LO....XXXXXXXXXX| [19:39] 00000350 58 58 1f 8b 08 00 00 00 00 00 02 03 d4 bd eb 72 |XX.............r| [19:39] 00000360 1d 47 92 26 f8 5f 66 fd 0e e7 4f cb 34 66 cb 8c |.G.&._f...O.4f..| [19:39] The X's are placeholders that Wget fills up after writing the whole record, so apparently it never got that far. [19:42] I am going to put all my warc information online [19:42] Is the wiki the best place? [19:42] Yes, I think so. [19:43] This gives you that final record: tail -c +1161896909 rogerebert.com.warc.gz | gunzip -c | less It's the Wget log, but incomplete. You should have gotten an error. Disk full? [19:44] no idea [19:53] Here is the bulk of my warc information - http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem [19:54] I am going to add a file formation section once I finish typing it up [20:05] what did i mention? wget makes duplicate record ids? [20:05] yes [20:07] while you're at it, can you add my warc tool to the wiki as well :) [20:07] I just added a WARC format section [20:08] no problem chazchaz [20:08] I mean chfoo [20:10] I know there are some details missing, feel free to add them in. I want this page to be the only thing someone has to read to master warc files [20:13] chfoo: Duplicate record IDs? That's new for me. [20:15] They're supposedly unique UUIDs. [20:16] it generates two resources records for MANIFEST.txt and wget_arguments.txt but the id is the same [20:22] ONE TWO ARCHIVE TEAM MEMBER APPEARANCES ON CBC SHOW: http://www.cbc.ca/spark/episodes/2013/04/12/213-data-longevity-integrative-thinking-virtual-staging/ [20:22] Take that, world [20:27] just passed 29k videos for g4video-web collection [20:31] chfoo, I could not find any licensing info [20:32] omf_: it's GPL v3 [20:34] so i look for spark podcast in IA and it doesn't really exist in [20:34] https://github.com/alard/CDX-Writer/compare/ignore-invalid-gzip-headers [20:35] with over 200+ episodes i think i will slowly start mirroring that [20:40] alard, Wouldn't forking from https://github.com/rajbot/CDX-Writer be better since the internetarchive one is a out of date fork of that? [20:40] Then again I do not know which version is used by IA at present [20:42] Oh my god, I want to punch the "right to forget" person in this blog [21:18] so i have a little problem with cbc spark descs [21:18] it has more then one line [21:26] chfoo, alard Anything major about the warc and cdx file formats missing from the wiki? I am trying to make it a big checklist so a developer can follow it and work with warcs [21:33] SketchCow: which blog ? [21:34] first episode of spark uploaded: https://archive.org/details/spark_20070905_3205 [21:50] Did anyone ever save Minitel? (if it was savable) [21:50] I asked this a month ago and never saw the answer because i disconnected. [21:57] Thanks for the additions alard keep em rolling in :) [22:02] I just listened to cbc show. I agree SketchCow the right to forget proponent is a fool [22:10] An international privacy expert who does not understand the web [22:28] Okay I got 13 tools for dealing with warc files on here http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem . What tools are missing? What information would you like to see on there? [22:29] Any other key metrics of the software we should be tracking? I got license, language, testing, docs, # of authors [22:32] so you guys will soon have all epsidoes of cbc spark podcast for 2007 [22:32] at least the one i can find [22:32] episode 2 and 3 are gone i guess [22:41] err.. connection messed up again [22:44] omf_: actually, I'd like to see an example that anyone could use to archive a site and make a WARC suitable for putting into the Wayback machine