#archiveteam 2013-04-12,Fri

↑back Search

Time	Nickname	Message
02:48 ^🔗	omf_	frontaalnaakt.nl is almost done uploading. Another site saved from religion
07:11 ^🔗	PepsiMax	omf_: as a Dutchie: risky click.
07:17 ^🔗	newbie13	damn the DB zip files still dont work :(
07:43 ^🔗	Nemo_bis	Hm, this is not working that well, is it? http://dsss.be/newegg-hard-drive-cost/
10:58 ^🔗	SketchCow	http://www.edwardbetts.com/price_per_tb/ is what I use
11:10 ^🔗	godane	so i got most of the web only towel talk from techtv
11:10 ^🔗	godane	thanks to this: http://web.archive.org/web/20030210160905/http://www.techtv.com/screensavers/aboutus/story/0,24330,3402140,00.html
11:11 ^🔗	godane	and yes they were interview with a towel
11:11 ^🔗	godane	but bad news is patrick nortan interview has no audio from what i can tell
15:29 ^🔗	SketchCow	FOS is going read-only.
15:29 ^🔗	SketchCow	We're getting a new machine!
15:30 ^🔗	SketchCow	Now, what to name it
15:32 ^🔗	SketchCow	Nailed
15:32 ^🔗	SketchCow	Honeycomb Hideout
15:48 ^🔗	Smiley	D:
15:48 ^🔗	Smiley	"FUCK YOU".
15:49 ^🔗	Smiley	Fata than foo.
16:25 ^🔗	godane	SketchCow: so FOS is almost died?
16:26 ^🔗	godane	must be trying to mirror everything to archive.org has fast has possible then
18:02 ^🔗	Smiley	godane: it's not that, it's drive failures or possibly faulty hardware faking drive failures.
18:15 ^🔗	godane	ok
18:15 ^🔗	godane	but still
18:15 ^🔗	godane	mirror it to IA
18:36 ^🔗	SketchCow	omf_: Your grabs of sites are not working, and are not deriving.
18:36 ^🔗	SketchCow	glitch.com, rogerebert.com and gamasutra.com have not worked.
18:37 ^🔗	Smiley	1D:
18:37 ^🔗	Smiley	wtf
18:39 ^🔗	omf_	All I did was use wget 1.14 and those sites probably have link cancer in them
18:40 ^🔗	SketchCow	http://www-tracey.us.archive.org/log_show.php?task_id=153967762
18:42 ^🔗	DFJustin	that link isn't public, try https://www.us.archive.org/log_show.php?task_id=153967762
18:42 ^🔗	omf_	It worked for me, probably because I am logged in
18:45 ^🔗	omf_	Is the derive code online? I checked https://github.com/internetarchive/ but couldn't find a project for it
18:49 ^🔗	omf_	Is this https://github.com/internetarchive/CDX-Writer up to date?
18:50 ^🔗	DFJustin	https://github.com/rajbot/CDX-Writer looks newer
18:51 ^🔗	Smiley	Adding to fail-reasons list: CDXIndex:gzip fail:gamasutra.warc.gz ...
19:02 ^🔗	omf_	Okay so I am trying to run cdx_writer.py from https://github.com/rajbot/CDX-Writer to see if I can get some more information locally.
19:02 ^🔗	omf_	The problem is there are no docs for how to do this. So poking around I find I need this dependency https://bitbucket.org/rajbot/warc-tools/overview but when I git clone it, there is a server error
19:02 ^🔗	Smiley	zlib.error: Error -3 while decompressing: incorrect header check :/ what does that even mean :<
19:02 ^🔗	omf_	I know what that means
19:03 ^🔗	omf_	warc.gz are a collection of warc records that are gz compressed
19:03 ^🔗	omf_	now gz files can have multiple separate entries
19:04 ^🔗	Smiley	so do they need splitting up or something to fix?
19:04 ^🔗	omf_	the last entry (I assume since I cannot get the tool running yet) is truncated and thus throwing the error. What I wonder is why there is no recovery for an issue like this when looking at the test suite shows there was some serious effort put in
19:07 ^🔗	alard	I get gzip: rogerebert.com.warc.gz: decompression OK, trailing garbage ignored , so I guess there's something missing at the end.
19:07 ^🔗	omf_	The specification for WARC itself has no mention of handling corrupt records, recovery, or anything dealing with broken files. {sigh}
19:07 ^🔗	omf_	alard, That is my take away as well
19:08 ^🔗	alard	Should that be included in the specification?
19:10 ^🔗	omf_	Well if you look at the gzip spec they have language about checking for errors in compliance tests as well as data verification
19:10 ^🔗	omf_	How tools should handle errors
19:11 ^🔗	alard	Is that rogerebert.com.warc.gz one warc or was it stitched together?
19:11 ^🔗	omf_	None of them were stiched together. I just ran wget and uploaded them when wget finished
19:12 ^🔗	alard	The log record at the end is missing in my uncompressed rogerebert.warc.
19:13 ^🔗	omf_	So how do you fix that? We have an existing tool
19:17 ^🔗	alard	In this case you could keep everything until the last gzip/warc record.
19:17 ^🔗	omf_	yep
19:18 ^🔗	omf_	I am kinda surprised this has not come up before
19:19 ^🔗	alard	It has. We have unfinished warcs. The megawarc builder checks for this and puts those warcs in the tar file.
19:20 ^🔗	omf_	But we don't have a way to fix this?
19:20 ^🔗	alard	But you shouldn't be doing it with every warc. That's strange.
19:20 ^🔗	omf_	It was 3 out like 20 so far
19:20 ^🔗	alard	No. No script.
19:20 ^🔗	omf_	and I got a few hundred more to upload
19:21 ^🔗	alard	This shouldn't happen if Wget works normally and doesn't exit halfway.
19:22 ^🔗	omf_	I agree
19:24 ^🔗	alard	Is there a standalone, easy to run cdx generator?
19:26 ^🔗	omf_	That is what I am looking for. I am thinking about opening some bug reports, see if I can help fix shit up
19:27 ^🔗	omf_	I should have a way to check and fix warcs before they are even uploaded
19:29 ^🔗	alard	All warcs generated with Wgets older than the very very latest Wget-git (or Wget+Lua) are somewhat broken.
19:29 ^🔗	alard	It's just that most tools don't see it.
19:30 ^🔗	omf_	chfoo, Mentioned that as well
19:39 ^🔗	alard	This is from the header of the last record in the rogerebert warc:
19:39 ^🔗	alard	00000340 4c 4f da 02 00 00 58 58 58 58 58 58 58 58 58 58 \|LO....XXXXXXXXXX\|
19:39 ^🔗	alard	00000350 58 58 1f 8b 08 00 00 00 00 00 02 03 d4 bd eb 72 \|XX.............r\|
19:39 ^🔗	alard	00000360 1d 47 92 26 f8 5f 66 fd 0e e7 4f cb 34 66 cb 8c \|.G.&._f...O.4f..\|
19:39 ^🔗	alard	The X's are placeholders that Wget fills up after writing the whole record, so apparently it never got that far.
19:42 ^🔗	omf_	I am going to put all my warc information online
19:42 ^🔗	omf_	Is the wiki the best place?
19:42 ^🔗	alard	Yes, I think so.
19:43 ^🔗	alard	This gives you that final record: tail -c +1161896909 rogerebert.com.warc.gz \| gunzip -c \| less It's the Wget log, but incomplete. You should have gotten an error. Disk full?
19:44 ^🔗	omf_	no idea
19:53 ^🔗	omf_	Here is the bulk of my warc information - http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
19:54 ^🔗	omf_	I am going to add a file formation section once I finish typing it up
20:05 ^🔗	chfoo	what did i mention? wget makes duplicate record ids?
20:05 ^🔗	omf_	yes
20:07 ^🔗	chfoo	while you're at it, can you add my warc tool to the wiki as well :)
20:07 ^🔗	omf_	I just added a WARC format section
20:08 ^🔗	omf_	no problem chazchaz
20:08 ^🔗	omf_	I mean chfoo
20:10 ^🔗	omf_	I know there are some details missing, feel free to add them in. I want this page to be the only thing someone has to read to master warc files
20:13 ^🔗	alard	chfoo: Duplicate record IDs? That's new for me.
20:15 ^🔗	alard	They're supposedly unique UUIDs.
20:16 ^🔗	chfoo	it generates two resources records for MANIFEST.txt and wget_arguments.txt but the id is the same
20:22 ^🔗	SketchCow	ONE TWO ARCHIVE TEAM MEMBER APPEARANCES ON CBC SHOW: http://www.cbc.ca/spark/episodes/2013/04/12/213-data-longevity-integrative-thinking-virtual-staging/
20:22 ^🔗	SketchCow	Take that, world
20:27 ^🔗	godane	just passed 29k videos for g4video-web collection
20:31 ^🔗	omf_	chfoo, I could not find any licensing info
20:32 ^🔗	chfoo	omf_: it's GPL v3
20:34 ^🔗	godane	so i look for spark podcast in IA and it doesn't really exist in
20:34 ^🔗	alard	https://github.com/alard/CDX-Writer/compare/ignore-invalid-gzip-headers
20:35 ^🔗	godane	with over 200+ episodes i think i will slowly start mirroring that
20:40 ^🔗	omf_	alard, Wouldn't forking from https://github.com/rajbot/CDX-Writer be better since the internetarchive one is a out of date fork of that?
20:40 ^🔗	omf_	Then again I do not know which version is used by IA at present
20:42 ^🔗	SketchCow	Oh my god, I want to punch the "right to forget" person in this blog
21:18 ^🔗	godane	so i have a little problem with cbc spark descs
21:18 ^🔗	godane	it has more then one line
21:26 ^🔗	omf_	chfoo, alard Anything major about the warc and cdx file formats missing from the wiki? I am trying to make it a big checklist so a developer can follow it and work with warcs
21:33 ^🔗	eadler	SketchCow: which blog ?
21:34 ^🔗	godane	first episode of spark uploaded: https://archive.org/details/spark_20070905_3205
21:50 ^🔗	arkhive	Did anyone ever save Minitel? (if it was savable)
21:50 ^🔗	arkhive	I asked this a month ago and never saw the answer because i disconnected.
21:57 ^🔗	omf_	Thanks for the additions alard keep em rolling in :)
22:02 ^🔗	omf_	I just listened to cbc show. I agree SketchCow the right to forget proponent is a fool
22:10 ^🔗	omf_	An international privacy expert who does not understand the web
22:28 ^🔗	omf_	Okay I got 13 tools for dealing with warc files on here http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem . What tools are missing? What information would you like to see on there?
22:29 ^🔗	omf_	Any other key metrics of the software we should be tracking? I got license, language, testing, docs, # of authors
22:32 ^🔗	godane	so you guys will soon have all epsidoes of cbc spark podcast for 2007
22:32 ^🔗	godane	at least the one i can find
22:32 ^🔗	godane	episode 2 and 3 are gone i guess
22:41 ^🔗	arkhive	err.. connection messed up again
22:44 ^🔗	dashcloud	omf_: actually, I'd like to see an example that anyone could use to archive a site and make a WARC suitable for putting into the Wayback machine

irclogger-viewer