[00:06] *** londoncal has quit (Remote host closed the connection) [00:31] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [00:37] *** Start has quit (Disconnected.) [00:47] *** Start (~Start@[redacted]) has joined #internetarchive.bak [01:22] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [01:27] *** Start has quit (Disconnected.) [01:42] *** Start (~Start@[redacted]) has joined #internetarchive.bak [01:43] *** svchfoo2 gives channel operator status to Start [02:01] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [02:04] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [03:21] they should be in the json. [03:21] they are not? [03:23] no, I don't think any of the files have a sha listed. only md5 [03:33] ooo [03:34] ok. csv of just the 22 milliom then. [03:38] ok, i can get that tomorrow [03:51] Thanks! [05:15] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [05:17] *** wp494 has quit (Ping timeout: 740 seconds) [05:38] *** wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak [06:58] *** db48x (~user@[redacted]) has joined #internetarchive.bak [07:00] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak [07:06] DFJustin: my current best guess is git-annex will be told the urls on archive.org and not see the file contents until they are downloaded into invidual's drives, so no deduplication in that case [07:06] except whatever the sha1s tell us [07:08] sep332: a list of files in a form like this would let me start generating test git-annex repos: sha1 size collection item file url [07:08] (CSV or something) [07:16] *** db48x has quit (Read error: Operation timed out) [07:31] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [08:16] *** londoncal has quit (Quit: Leaving...) [09:14] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [09:15] *** thunk has quit (Client Quit) [09:56] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [09:59] *** bzc6p_ has quit (Ping timeout: 600 seconds) [10:25] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [12:40] *** VADemon (~VADemon@[redacted]) has joined #internetarchive.bak [13:15] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [13:16] *** thunk has quit (Client Quit) [13:16] *** zottelbey has quit (Ping timeout: 512 seconds) [13:17] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [13:22] I had a chat about that websrv-file by the way. [13:24] It turns out they intentionally dupe websrv-XXXX files because they're critical indexes. [13:25] oh ok. well that's good [13:29] closure: some of these items have multiple collections listed. [13:29] Right. Obviously, our backup might turn off half of them. [13:29] for example the first one in the file: item "Urdu-Trana-001" is in "iraq_middleeast","iraq_war" and "newsandpublicaffairs" [13:29] but they'll all be websrv-XXXX-0 and -1 [13:29] If you have time, it'd be neat to see how much space that is (and if they match up) [13:34] *** sep332 (~sep332@[redacted]) has left #internetarchive.bak [13:34] *** sep332 (~sep332@[redacted]) has joined #internetarchive.bak [13:34] *** svchfoo2 gives channel operator status to sep332 [13:39] sep332: multiples columns for anything except the sha1 will be fine [14:00] *** bzc6p__ is now known as bzc6p [14:06] i'm having trouble with jq syntax [14:07] i can get the attributes of each file on a line, or each item on a line [14:07] but not both [14:18] *** Start has quit (Disconnected.) [14:44] ok I got the syntax. it's just running very slowly [14:51] *** Laverne (~Laverne@[redacted]) has joined #internetarchive.bak [15:03] *** Start (~Start@[redacted]) has joined #internetarchive.bak [15:09] *** caber (~caber@[redacted]) has joined #internetarchive.bak [15:10] *** db48x (~user@[redacted]) has joined #internetarchive.bak [15:30] *** tephra (~tephra@[redacted]) has left #internetarchive.bak [15:39] geez, 120GB and still going [15:41] what're you working on? [15:41] looking for duplicate files in the census [15:41] :) [15:43] dupe items will be pritty large [15:47] it's funny, my intermediate files are repeating the item name and collections for each file, instead of once per item [15:47] so i'm created a lot of redundant data in order to look for redundant data [15:51] *** Start has quit (Disconnected.) [15:57] *** Start (~Start@[redacted]) has joined #internetarchive.bak [15:58] *** Start has quit (Read error: Connection reset by peer) [15:58] *** johtso (uid563@[redacted]) has joined #internetarchive.bak [15:58] *** Start (~Start@[redacted]) has joined #internetarchive.bak [16:20] ah and my syntax was wrong again. that's why the files were so big. [16:20] that's good, i didn't want to run "sort" on 200GB files! [16:27] *** patrickod (~patrickod@[redacted]) has joined #internetarchive.bak [16:45] *** Start has quit (Disconnected.) [16:53] *** Start (~Start@[redacted]) has joined #internetarchive.bak [17:16] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [17:17] *** thunk has quit (Client Quit) [17:30] phew data extracted, sort commenced. 33.2GB this time, much better! [17:34] *** Start has quit (Read error: Connection reset by peer) [17:34] *** Start_ (~Start@[redacted]) has joined #internetarchive.bak [17:35] *** Start_ is now known as Start [17:43] *** Start has quit (Disconnected.) [18:42] *** Start (~Start@[redacted]) has joined #internetarchive.bak [18:44] *** Start has quit (Read error: Connection reset by peer) [18:44] *** Start_ (~Start@[redacted]) has joined #internetarchive.bak [19:02] *** Start_ has quit (Disconnected.) [19:17] *** Start (~Start@[redacted]) has joined #internetarchive.bak [19:20] *** zottelbey has quit (Ping timeout: 512 seconds) [19:21] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [19:27] *** Start has quit (Disconnected.) [19:32] *** Start (~Start@[redacted]) has joined #internetarchive.bak [20:17] *** Start has quit (Disconnected.) [21:02] sort and uniq done! 32 million lines, 4.1 GB [21:02] note to self: never use pv with the -l flag, it's super-slow [21:04] for super long lines yeah [21:07] oh is that what it is? I didn't think these lines were that long but it was using 100% cpu for two hours :p [21:08] it averages 135 characters per line [21:11] oh hmm [21:11] odd, pv -l is usually fast enough for me [21:11] maybe we have different requirements [21:12] i want it to be transparent - if it's slower than doing actual work, it's too slow :) [21:13] it couldn't feel uniq fast enough. which is admittedly a tough job. [21:14] I often don't feel uniq [21:16] lol i meant feed [21:17] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [21:18] *** thunk has quit (Client Quit) [21:52] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [21:57] closure: https://www.dropbox.com/sh/u6trh0ldjuj3k3i/AAD4XDhJG6nhQBW11Q3Gy-FNa?dl=0 [21:57] it's the big one lol [21:58] *** bzc6p has quit (Ping timeout: 600 seconds) [21:59] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak [22:00] afk for an hour but let me know how it looks [22:35] sep332, why did you use md5 and not sha1 (possible hash collisions)? [22:50] I don't have the sha1. the census file only has md5 [22:51] anyway i doubt there are collisions unless someone was really trying to make a collision. md5 is only weak if you're attacking it [23:09] I ran the file through sort, but it came out scrambled anyway. i could not figure out what it sorted on [23:09] turns out it's a numeric sort of the filename. d'oh! [23:24] *** X-Scale has quit (Ping timeout: 240 seconds) [23:33] *** niyaje has quit (Ping timeout: 600 seconds) [23:47] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak