[00:15] *** Start (~Start@[redacted]) has joined #internetarchive.bak [00:16] *** svchfoo2 gives channel operator status to Start [00:29] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [00:40] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [01:08] sep332: looks ok.. is there a way to construct an url from the fields there? [01:08] also, I can't tell what if anything you've done to handle commas in filenames.. [01:09] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [01:53] *** zottelbey has quit (Ping timeout: 512 seconds) [02:06] the URL is just https://archive.org/download/[itemid]/[filename] [02:07] would just putting quotes around the filenames help? [02:08] percent encode? [02:10] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [02:11] that would be nicer but I don't know how to do that yet [02:23] wut language? [02:24] jq [02:25] http://stedolan.github.io/jq/manual/#Formatstringsandescaping [02:25] looks promising [02:25] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [02:29] *** marvinw has quit (Read error: Operation timed out) [02:37] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [02:38] *** phuzion has quit (Read error: Operation timed out) [02:41] *** marvinw (~marvinw@[redacted]) has joined #internetarchive.bak [02:46] *** johtso has quit (Quit: Connection closed for inactivity) [02:50] *** niyaje has quit (Ping timeout: 600 seconds) [03:21] *** phuzion (~phuzion@[redacted]) has joined #internetarchive.bak [03:50] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [04:42] *** db48x has quit (Read error: Connection reset by peer) [04:44] I'm kinda impressed jq is a 'language' [05:22] sep332: What's the conclusions? How much space in duplicates? [07:49] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [07:50] *** thunk has quit (Client Quit) [08:27] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak [09:02] *** GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak [09:10] *** serapeum (~serapeum@[redacted]) has joined #internetarchive.bak [09:47] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [09:47] *** mhazinsk has quit (hub.efnet.us irc.umich.edu) [09:47] *** trs80 has quit (hub.efnet.us irc.umich.edu) [09:52] *** bzc6p_ has quit (Ping timeout: 600 seconds) [09:53] *** Laverne has quit (Read error: Operation timed out) [09:53] *** chazchaz has quit (Read error: Operation timed out) [09:53] *** svchfoo1 has quit (Read error: Operation timed out) [09:53] *** aschmitz has quit (Read error: Operation timed out) [09:54] *** shabble has quit (Read error: Operation timed out) [09:54] *** shabble (~shabble@[redacted]) has joined #internetarchive.bak [09:55] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak [09:59] *** swebb has quit (Ping timeout: 369 seconds) [10:00] *** chazchaz (~chazchaz@[redacted]) has joined #internetarchive.bak [10:00] *** Laverne (~Laverne@[redacted]) has joined #internetarchive.bak [10:00] *** swebb (~swebb@[redacted]) has joined #internetarchive.bak [10:02] *** svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak [10:02] *** phuzion has quit (Read error: Operation timed out) [10:02] *** sep332 has quit (Read error: Operation timed out) [10:02] *** achip has quit (Read error: Operation timed out) [10:02] *** svchfoo2 gives channel operator status to svchfoo1 [10:03] *** londonca_ (~londoncal@[redacted]) has joined #internetarchive.bak [10:03] *** rossdylan has quit (Read error: Operation timed out) [10:03] *** dirt has quit (Read error: Operation timed out) [10:03] *** acridAxid has quit (Read error: Operation timed out) [10:04] *** acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak [10:04] *** svchfoo1 gives channel operator status to acridAxid [10:04] *** GLaDOS has quit (Read error: Operation timed out) [10:05] *** GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak [10:06] *** hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak [10:08] *** londoncal has quit (Read error: Operation timed out) [10:09] *** hatsefla1 has quit (Read error: Operation timed out) [10:10] *** wp494 has quit (Quit: LOUD UNNECESSARY QUIT MESSAGES) [10:10] *** Ctrl-S has quit (Read error: Operation timed out) [10:10] *** bzc6p__ has quit (Read error: Operation timed out) [10:13] *** sep332 (~sep332@[redacted]) has joined #internetarchive.bak [10:13] *** dirt (james@[redacted]) has joined #internetarchive.bak [10:13] *** achip (~thechip@[redacted]) has joined #internetarchive.bak [10:13] *** svchfoo1 gives channel operator status to sep332 [10:13] *** bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak [10:15] *** phuzion (~phuzion@[redacted]) has joined #internetarchive.bak [10:16] *** acridAxid has quit (Quit: Quitting) [10:17] *** Ctrl-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak [10:17] *** svchfoo1 gives channel operator status to Ctrl-S [10:18] *** wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak [10:18] *** acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak [10:19] *** svchfoo2 gives channel operator status to acridAxid [10:19] *** rossdylan (~rossdylan@[redacted]) has joined #internetarchive.bak [10:36] *** Kenshin has quit (Ping timeout: 258 seconds) [10:37] *** irc.umich.edu gives channel operator status to trs80 mhazinsk [10:37] *** mhazinsk (~matt@[redacted]) has joined #internetarchive.bak [10:37] *** trs80 (trs80@[redacted]) has joined #internetarchive.bak [10:53] *** Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak [10:54] *** svchfoo1 gives channel operator status to Kenshin [12:06] *** trs80 has quit (Ping timeout: 186 seconds) [12:08] *** trs80 (~trs80@[redacted]) has joined #internetarchive.bak [12:11] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [12:41] *** X-Scale has quit (Ping timeout: 240 seconds) [12:45] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak [13:49] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [14:17] SketchCow: there are 659 TB of files listed. So more then half of that is duplicated. [14:57] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [15:20] *** destrudo has quit (Remote host closed the connection) [15:31] *** destrudo (~destrudo@[redacted]) has joined #internetarchive.bak [15:35] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [15:39] *** johtso (uid563@[redacted]) has joined #internetarchive.bak [16:10] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [17:00] *** garyrh has quit (http://bnc4free.com/) [17:20] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [18:04] *** londonca_ has quit (Quit: Leaving...) [18:05] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [19:02] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [19:49] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [20:54] *** db48x (~user@[redacted]) has joined #internetarchive.bak [20:55] *** garyrh_ (~Mithrandi@[redacted]) has joined #internetarchive.bak [21:30] *** DFJustin has quit (Ping timeout: 740 seconds) [21:42] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [21:43] Anyway, tracey says use md5sum AND sha-1 [21:44] still only md5sum in the current census [21:44] getting sha1s into it would be really helpful. [21:45] git-annex has only supported using md5 since february [22:00] jq is what the IA guys are using for this dataset. It's pretty rad! [22:00] zcat public-file-size-md_20150304205357.json.gz | ./jq '.files[] | .name, .size' |head [22:00] "Sabeeluna-al-jeehed.mp3" [22:00] 3227238 [22:00] "Dilon_kay_Hukmaran.mp3" [22:00] 3737995 [22:01] sep332: I think I'll go with this, makes it easy to parse and deal with any changes in the censues [22:07] jq --raw-output '.collection , .id , (.files[] | .name, .size)' [22:11] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [22:11] jq --raw-output '.collection , "i:" + .id , (.files[] | "f:" + .name , .size)' [22:14] jq --raw-output '.collection , "i:" + .id , (.files[] | "f:" + .name , "s:" + (.size | tostring))' [22:18] *** bzc6p has quit (Read error: Operation timed out) [22:44] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [22:47] jq --raw-output '.id as $foo | .files[] | @uri "https://archive.org/\($foo)/\(.name)"' <-- this is pretty awesome, it's a list of urls! [22:52] :) [22:56] kinda weird this 1st item in the census is 404 https://archive.org/details/Urdu-Trana-001 [22:57] *** rossdylan has quit (Read error: Operation timed out) [23:20] argh, this dataset sometimes has id: "string" and sometimes the id is an array. makes it hard to get urls that always work [23:43] jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | @uri "\($coll)\thttps://archive.org/\($id)/\(.name)"' [23:44] yay, that works for both arrays and not-arrays! [23:45] jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | @uri "\(.md5)\t\($coll)\thttps://archive.org/\($id)/\(.name)"' [23:46] *** closure is running that now on the full census to get a md5_collection_url.txt [23:46] already 3 million lines in the file, this is fast!