#internetarchive.bak 2015-03-14,Sat

↑back Search

Time Nickname Message
00:15 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
00:16 🔗 svchfoo2 gives channel operator status to Start
00:29 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
00:40 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
01:08 🔗 closure sep332: looks ok.. is there a way to construct an url from the fields there?
01:08 🔗 closure also, I can't tell what if anything you've done to handle commas in filenames..
01:09 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
01:53 🔗 zottelbey has quit (Ping timeout: 512 seconds)
02:06 🔗 sep332 the URL is just https://archive.org/download/[itemid]/[filename]
02:07 🔗 sep332 would just putting quotes around the filenames help?
02:08 🔗 Ctrl-S percent encode?
02:10 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
02:11 🔗 sep332 that would be nicer but I don't know how to do that yet
02:23 🔗 Ctrl-S wut language?
02:24 🔗 sep332 jq
02:25 🔗 sep332 http://stedolan.github.io/jq/manual/#Formatstringsandescaping
02:25 🔗 sep332 looks promising
02:25 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
02:29 🔗 marvinw has quit (Read error: Operation timed out)
02:37 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
02:38 🔗 phuzion has quit (Read error: Operation timed out)
02:41 🔗 marvinw (~marvinw@[redacted]) has joined #internetarchive.bak
02:46 🔗 johtso has quit (Quit: Connection closed for inactivity)
02:50 🔗 niyaje has quit (Ping timeout: 600 seconds)
03:21 🔗 phuzion (~phuzion@[redacted]) has joined #internetarchive.bak
03:50 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
04:42 🔗 db48x has quit (Read error: Connection reset by peer)
04:44 🔗 espes___ I'm kinda impressed jq is a 'language'
05:22 🔗 SketchCow sep332: What's the conclusions? How much space in duplicates?
07:49 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
07:50 🔗 thunk has quit (Client Quit)
08:27 🔗 X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
09:02 🔗 GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak
09:10 🔗 serapeum (~serapeum@[redacted]) has joined #internetarchive.bak
09:47 🔗 bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
09:47 🔗 mhazinsk has quit (hub.efnet.us irc.umich.edu)
09:47 🔗 trs80 has quit (hub.efnet.us irc.umich.edu)
09:52 🔗 bzc6p_ has quit (Ping timeout: 600 seconds)
09:53 🔗 Laverne has quit (Read error: Operation timed out)
09:53 🔗 chazchaz has quit (Read error: Operation timed out)
09:53 🔗 svchfoo1 has quit (Read error: Operation timed out)
09:53 🔗 aschmitz has quit (Read error: Operation timed out)
09:54 🔗 shabble has quit (Read error: Operation timed out)
09:54 🔗 shabble (~shabble@[redacted]) has joined #internetarchive.bak
09:55 🔗 aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
09:59 🔗 swebb has quit (Ping timeout: 369 seconds)
10:00 🔗 chazchaz (~chazchaz@[redacted]) has joined #internetarchive.bak
10:00 🔗 Laverne (~Laverne@[redacted]) has joined #internetarchive.bak
10:00 🔗 swebb (~swebb@[redacted]) has joined #internetarchive.bak
10:02 🔗 svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak
10:02 🔗 phuzion has quit (Read error: Operation timed out)
10:02 🔗 sep332 has quit (Read error: Operation timed out)
10:02 🔗 achip has quit (Read error: Operation timed out)
10:02 🔗 svchfoo2 gives channel operator status to svchfoo1
10:03 🔗 londonca_ (~londoncal@[redacted]) has joined #internetarchive.bak
10:03 🔗 rossdylan has quit (Read error: Operation timed out)
10:03 🔗 dirt has quit (Read error: Operation timed out)
10:03 🔗 acridAxid has quit (Read error: Operation timed out)
10:04 🔗 acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak
10:04 🔗 svchfoo1 gives channel operator status to acridAxid
10:04 🔗 GLaDOS has quit (Read error: Operation timed out)
10:05 🔗 GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak
10:06 🔗 hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak
10:08 🔗 londoncal has quit (Read error: Operation timed out)
10:09 🔗 hatsefla1 has quit (Read error: Operation timed out)
10:10 🔗 wp494 has quit (Quit: LOUD UNNECESSARY QUIT MESSAGES)
10:10 🔗 Ctrl-S has quit (Read error: Operation timed out)
10:10 🔗 bzc6p__ has quit (Read error: Operation timed out)
10:13 🔗 sep332 (~sep332@[redacted]) has joined #internetarchive.bak
10:13 🔗 dirt (james@[redacted]) has joined #internetarchive.bak
10:13 🔗 achip (~thechip@[redacted]) has joined #internetarchive.bak
10:13 🔗 svchfoo1 gives channel operator status to sep332
10:13 🔗 bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
10:15 🔗 phuzion (~phuzion@[redacted]) has joined #internetarchive.bak
10:16 🔗 acridAxid has quit (Quit: Quitting)
10:17 🔗 Ctrl-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak
10:17 🔗 svchfoo1 gives channel operator status to Ctrl-S
10:18 🔗 wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
10:18 🔗 acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak
10:19 🔗 svchfoo2 gives channel operator status to acridAxid
10:19 🔗 rossdylan (~rossdylan@[redacted]) has joined #internetarchive.bak
10:36 🔗 Kenshin has quit (Ping timeout: 258 seconds)
10:37 🔗 irc.umich.edu gives channel operator status to trs80 mhazinsk
10:37 🔗 mhazinsk (~matt@[redacted]) has joined #internetarchive.bak
10:37 🔗 trs80 (trs80@[redacted]) has joined #internetarchive.bak
10:53 🔗 Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak
10:54 🔗 svchfoo1 gives channel operator status to Kenshin
12:06 🔗 trs80 has quit (Ping timeout: 186 seconds)
12:08 🔗 trs80 (~trs80@[redacted]) has joined #internetarchive.bak
12:11 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
12:41 🔗 X-Scale has quit (Ping timeout: 240 seconds)
12:45 🔗 X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
13:49 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
14:17 🔗 sep332 SketchCow: there are 659 TB of files listed. So more then half of that is duplicated.
14:57 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
15:20 🔗 destrudo has quit (Remote host closed the connection)
15:31 🔗 destrudo (~destrudo@[redacted]) has joined #internetarchive.bak
15:35 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
15:39 🔗 johtso (uid563@[redacted]) has joined #internetarchive.bak
16:10 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
17:00 🔗 garyrh has quit (http://bnc4free.com/)
17:20 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
18:04 🔗 londonca_ has quit (Quit: Leaving...)
18:05 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
19:02 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
19:49 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
20:54 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
20:55 🔗 garyrh_ (~Mithrandi@[redacted]) has joined #internetarchive.bak
21:30 🔗 DFJustin has quit (Ping timeout: 740 seconds)
21:42 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
21:43 🔗 closure <SketchCow> Anyway, tracey says use md5sum AND sha-1
21:44 🔗 closure still only md5sum in the current census
21:44 🔗 closure getting sha1s into it would be really helpful.
21:45 🔗 closure git-annex has only supported using md5 since february
22:00 🔗 closure jq is what the IA guys are using for this dataset. It's pretty rad!
22:00 🔗 closure zcat public-file-size-md_20150304205357.json.gz | ./jq '.files[] | .name, .size' |head
22:00 🔗 closure "Sabeeluna-al-jeehed.mp3"
22:00 🔗 closure 3227238
22:00 🔗 closure "Dilon_kay_Hukmaran.mp3"
22:00 🔗 closure 3737995
22:01 🔗 closure sep332: I think I'll go with this, makes it easy to parse and deal with any changes in the censues
22:07 🔗 closure jq --raw-output '.collection , .id , (.files[] | .name, .size)'
22:11 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
22:11 🔗 closure jq --raw-output '.collection , "i:" + .id , (.files[] | "f:" + .name , .size)'
22:14 🔗 closure jq --raw-output '.collection , "i:" + .id , (.files[] | "f:" + .name , "s:" + (.size | tostring))'
22:18 🔗 bzc6p has quit (Read error: Operation timed out)
22:44 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
22:47 🔗 closure jq --raw-output '.id as $foo | .files[] | @uri "https://archive.org/\($foo)/\(.name)"' <-- this is pretty awesome, it's a list of urls!
22:52 🔗 db48x :)
22:56 🔗 closure kinda weird this 1st item in the census is 404 https://archive.org/details/Urdu-Trana-001
22:57 🔗 rossdylan has quit (Read error: Operation timed out)
23:20 🔗 closure argh, this dataset sometimes has id: "string" and sometimes the id is an array. makes it hard to get urls that always work
23:43 🔗 closure jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | @uri "\($coll)\thttps://archive.org/\($id)/\(.name)"'
23:44 🔗 closure yay, that works for both arrays and not-arrays!
23:45 🔗 closure jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | @uri "\(.md5)\t\($coll)\thttps://archive.org/\($id)/\(.name)"'
23:46 🔗 closure is running that now on the full census to get a md5_collection_url.txt
23:46 🔗 closure already 3 million lines in the file, this is fast!

irclogger-viewer