#archiveteam 2013-04-12,Fri

↑back Search

Time Nickname Message
02:48 🔗 omf_ frontaalnaakt.nl is almost done uploading. Another site saved from religion
07:11 🔗 PepsiMax omf_: as a Dutchie: risky click.
07:17 🔗 newbie13 damn the DB zip files still dont work :(
07:43 🔗 Nemo_bis Hm, this is not working that well, is it? http://dsss.be/newegg-hard-drive-cost/
10:58 🔗 SketchCow http://www.edwardbetts.com/price_per_tb/ is what I use
11:10 🔗 godane so i got most of the web only towel talk from techtv
11:10 🔗 godane thanks to this: http://web.archive.org/web/20030210160905/http://www.techtv.com/screensavers/aboutus/story/0,24330,3402140,00.html
11:11 🔗 godane and yes they were interview with a towel
11:11 🔗 godane but bad news is patrick nortan interview has no audio from what i can tell
15:29 🔗 SketchCow FOS is going read-only.
15:29 🔗 SketchCow We're getting a new machine!
15:30 🔗 SketchCow Now, what to name it
15:32 🔗 SketchCow Nailed
15:32 🔗 SketchCow Honeycomb Hideout
15:48 🔗 Smiley D:
15:48 🔗 Smiley "FUCK YOU".
15:49 🔗 Smiley Fata than foo.
16:25 🔗 godane SketchCow: so FOS is almost died?
16:26 🔗 godane must be trying to mirror everything to archive.org has fast has possible then
18:02 🔗 Smiley godane: it's not that, it's drive failures or possibly faulty hardware faking drive failures.
18:15 🔗 godane ok
18:15 🔗 godane but still
18:15 🔗 godane mirror it to IA
18:36 🔗 SketchCow omf_: Your grabs of sites are not working, and are not deriving.
18:36 🔗 SketchCow glitch.com, rogerebert.com and gamasutra.com have not worked.
18:37 🔗 Smiley 1D:
18:37 🔗 Smiley wtf
18:39 🔗 omf_ All I did was use wget 1.14 and those sites probably have link cancer in them
18:40 🔗 SketchCow http://www-tracey.us.archive.org/log_show.php?task_id=153967762
18:42 🔗 DFJustin that link isn't public, try https://www.us.archive.org/log_show.php?task_id=153967762
18:42 🔗 omf_ It worked for me, probably because I am logged in
18:45 🔗 omf_ Is the derive code online? I checked https://github.com/internetarchive/ but couldn't find a project for it
18:49 🔗 omf_ Is this https://github.com/internetarchive/CDX-Writer up to date?
18:50 🔗 DFJustin https://github.com/rajbot/CDX-Writer looks newer
18:51 🔗 Smiley Adding to fail-reasons list: CDXIndex:gzip fail:gamasutra.warc.gz ...
19:02 🔗 omf_ Okay so I am trying to run cdx_writer.py from https://github.com/rajbot/CDX-Writer to see if I can get some more information locally.
19:02 🔗 omf_ The problem is there are no docs for how to do this. So poking around I find I need this dependency https://bitbucket.org/rajbot/warc-tools/overview but when I git clone it, there is a server error
19:02 🔗 Smiley zlib.error: Error -3 while decompressing: incorrect header check :/ what does that even mean :<
19:02 🔗 omf_ I know what that means
19:03 🔗 omf_ warc.gz are a collection of warc records that are gz compressed
19:03 🔗 omf_ now gz files can have multiple separate entries
19:04 🔗 Smiley so do they need splitting up or something to fix?
19:04 🔗 omf_ the last entry (I assume since I cannot get the tool running yet) is truncated and thus throwing the error. What I wonder is why there is no recovery for an issue like this when looking at the test suite shows there was some serious effort put in
19:07 🔗 alard I get gzip: rogerebert.com.warc.gz: decompression OK, trailing garbage ignored , so I guess there's something missing at the end.
19:07 🔗 omf_ The specification for WARC itself has no mention of handling corrupt records, recovery, or anything dealing with broken files. {sigh}
19:07 🔗 omf_ alard, That is my take away as well
19:08 🔗 alard Should that be included in the specification?
19:10 🔗 omf_ Well if you look at the gzip spec they have language about checking for errors in compliance tests as well as data verification
19:10 🔗 omf_ How tools should handle errors
19:11 🔗 alard Is that rogerebert.com.warc.gz one warc or was it stitched together?
19:11 🔗 omf_ None of them were stiched together. I just ran wget and uploaded them when wget finished
19:12 🔗 alard The log record at the end is missing in my uncompressed rogerebert.warc.
19:13 🔗 omf_ So how do you fix that? We have an existing tool
19:17 🔗 alard In this case you could keep everything until the last gzip/warc record.
19:17 🔗 omf_ yep
19:18 🔗 omf_ I am kinda surprised this has not come up before
19:19 🔗 alard It has. We have unfinished warcs. The megawarc builder checks for this and puts those warcs in the tar file.
19:20 🔗 omf_ But we don't have a way to fix this?
19:20 🔗 alard But you shouldn't be doing it with every warc. That's strange.
19:20 🔗 omf_ It was 3 out like 20 so far
19:20 🔗 alard No. No script.
19:20 🔗 omf_ and I got a few hundred more to upload
19:21 🔗 alard This shouldn't happen if Wget works normally and doesn't exit halfway.
19:22 🔗 omf_ I agree
19:24 🔗 alard Is there a standalone, easy to run cdx generator?
19:26 🔗 omf_ That is what I am looking for. I am thinking about opening some bug reports, see if I can help fix shit up
19:27 🔗 omf_ I should have a way to check and fix warcs before they are even uploaded
19:29 🔗 alard All warcs generated with Wgets older than the very very latest Wget-git (or Wget+Lua) are somewhat broken.
19:29 🔗 alard It's just that most tools don't see it.
19:30 🔗 omf_ chfoo, Mentioned that as well
19:39 🔗 alard This is from the header of the last record in the rogerebert warc:
19:39 🔗 alard 00000340 4c 4f da 02 00 00 58 58 58 58 58 58 58 58 58 58 |LO....XXXXXXXXXX|
19:39 🔗 alard 00000350 58 58 1f 8b 08 00 00 00 00 00 02 03 d4 bd eb 72 |XX.............r|
19:39 🔗 alard 00000360 1d 47 92 26 f8 5f 66 fd 0e e7 4f cb 34 66 cb 8c |.G.&._f...O.4f..|
19:39 🔗 alard The X's are placeholders that Wget fills up after writing the whole record, so apparently it never got that far.
19:42 🔗 omf_ I am going to put all my warc information online
19:42 🔗 omf_ Is the wiki the best place?
19:42 🔗 alard Yes, I think so.
19:43 🔗 alard This gives you that final record: tail -c +1161896909 rogerebert.com.warc.gz | gunzip -c | less It's the Wget log, but incomplete. You should have gotten an error. Disk full?
19:44 🔗 omf_ no idea
19:53 🔗 omf_ Here is the bulk of my warc information - http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
19:54 🔗 omf_ I am going to add a file formation section once I finish typing it up
20:05 🔗 chfoo what did i mention? wget makes duplicate record ids?
20:05 🔗 omf_ yes
20:07 🔗 chfoo while you're at it, can you add my warc tool to the wiki as well :)
20:07 🔗 omf_ I just added a WARC format section
20:08 🔗 omf_ no problem chazchaz
20:08 🔗 omf_ I mean chfoo
20:10 🔗 omf_ I know there are some details missing, feel free to add them in. I want this page to be the only thing someone has to read to master warc files
20:13 🔗 alard chfoo: Duplicate record IDs? That's new for me.
20:15 🔗 alard They're supposedly unique UUIDs.
20:16 🔗 chfoo it generates two resources records for MANIFEST.txt and wget_arguments.txt but the id is the same
20:22 🔗 SketchCow ONE TWO ARCHIVE TEAM MEMBER APPEARANCES ON CBC SHOW: http://www.cbc.ca/spark/episodes/2013/04/12/213-data-longevity-integrative-thinking-virtual-staging/
20:22 🔗 SketchCow Take that, world
20:27 🔗 godane just passed 29k videos for g4video-web collection
20:31 🔗 omf_ chfoo, I could not find any licensing info
20:32 🔗 chfoo omf_: it's GPL v3
20:34 🔗 godane so i look for spark podcast in IA and it doesn't really exist in
20:34 🔗 alard https://github.com/alard/CDX-Writer/compare/ignore-invalid-gzip-headers
20:35 🔗 godane with over 200+ episodes i think i will slowly start mirroring that
20:40 🔗 omf_ alard, Wouldn't forking from https://github.com/rajbot/CDX-Writer be better since the internetarchive one is a out of date fork of that?
20:40 🔗 omf_ Then again I do not know which version is used by IA at present
20:42 🔗 SketchCow Oh my god, I want to punch the "right to forget" person in this blog
21:18 🔗 godane so i have a little problem with cbc spark descs
21:18 🔗 godane it has more then one line
21:26 🔗 omf_ chfoo, alard Anything major about the warc and cdx file formats missing from the wiki? I am trying to make it a big checklist so a developer can follow it and work with warcs
21:33 🔗 eadler SketchCow: which blog ?
21:34 🔗 godane first episode of spark uploaded: https://archive.org/details/spark_20070905_3205
21:50 🔗 arkhive Did anyone ever save Minitel? (if it was savable)
21:50 🔗 arkhive I asked this a month ago and never saw the answer because i disconnected.
21:57 🔗 omf_ Thanks for the additions alard keep em rolling in :)
22:02 🔗 omf_ I just listened to cbc show. I agree SketchCow the right to forget proponent is a fool
22:10 🔗 omf_ An international privacy expert who does not understand the web
22:28 🔗 omf_ Okay I got 13 tools for dealing with warc files on here http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem . What tools are missing? What information would you like to see on there?
22:29 🔗 omf_ Any other key metrics of the software we should be tracking? I got license, language, testing, docs, # of authors
22:32 🔗 godane so you guys will soon have all epsidoes of cbc spark podcast for 2007
22:32 🔗 godane at least the one i can find
22:32 🔗 godane episode 2 and 3 are gone i guess
22:41 🔗 arkhive err.. connection messed up again
22:44 🔗 dashcloud omf_: actually, I'd like to see an example that anyone could use to archive a site and make a WARC suitable for putting into the Wayback machine

irclogger-viewer