[00:06] *** londoncal has quit (Remote host closed the connection)
[00:31] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[00:37] *** Start has quit (Disconnected.)
[00:47] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[01:22] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[01:27] *** Start has quit (Disconnected.)
[01:42] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[01:43] *** svchfoo2 gives channel operator status to Start
[02:01] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[02:04] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[03:21] <SketchCow> they should be in the json.
[03:21] <SketchCow> they are not?
[03:23] <sep332> no, I don't think any of the files have a sha listed. only md5
[03:33] <SketchCow> ooo
[03:34] <SketchCow> ok. csv of just the 22 milliom then.
[03:38] <sep332> ok, i can get that tomorrow
[03:51] <SketchCow> Thanks!
[05:15] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[05:17] *** wp494 has quit (Ping timeout: 740 seconds)
[05:38] *** wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
[06:58] *** db48x (~user@[redacted]) has joined #internetarchive.bak
[07:00] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
[07:06] <closure> DFJustin: my current best guess is git-annex will be told the urls on archive.org and not see the file contents until they are downloaded into invidual's drives, so no deduplication in that case
[07:06] <closure> except whatever the sha1s tell us
[07:08] <closure> sep332: a list of files in a form like this would let me start generating test git-annex repos: sha1 size collection item file url
[07:08] <closure> (CSV or something)
[07:16] *** db48x has quit (Read error: Operation timed out)
[07:31] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
[08:16] *** londoncal has quit (Quit: Leaving...)
[09:14] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[09:15] *** thunk has quit (Client Quit)
[09:56] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
[09:59] *** bzc6p_ has quit (Ping timeout: 600 seconds)
[10:25] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
[12:40] *** VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
[13:15] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[13:16] *** thunk has quit (Client Quit)
[13:16] *** zottelbey has quit (Ping timeout: 512 seconds)
[13:17] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
[13:22] <SketchCow> I had a chat about that websrv-file by the way.
[13:24] <SketchCow> It turns out they intentionally dupe websrv-XXXX files because they're critical indexes.
[13:25] <sep332> oh ok. well that's good
[13:29] <sep332> closure: some of these items have multiple collections listed.
[13:29] <SketchCow> Right. Obviously, our backup might turn off half of them.
[13:29] <sep332> for example the first one in the file: item "Urdu-Trana-001" is in "iraq_middleeast","iraq_war" and "newsandpublicaffairs"
[13:29] <SketchCow> but they'll all be websrv-XXXX-0 and -1
[13:29] <SketchCow> If you have time, it'd be neat to see how much space that is (and if they match up)
[13:34] *** sep332 (~sep332@[redacted]) has left #internetarchive.bak
[13:34] *** sep332 (~sep332@[redacted]) has joined #internetarchive.bak
[13:34] *** svchfoo2 gives channel operator status to sep332
[13:39] <closure> sep332: multiples columns for anything except the sha1 will be fine
[14:00] *** bzc6p__ is now known as bzc6p
[14:06] <sep332> i'm having trouble with jq syntax
[14:07] <sep332> i can get the attributes of each file on a line, or each item on a line
[14:07] <sep332> but not both
[14:18] *** Start has quit (Disconnected.)
[14:44] <sep332> ok I got the syntax. it's just running very slowly
[14:51] *** Laverne (~Laverne@[redacted]) has joined #internetarchive.bak
[15:03] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[15:09] *** caber (~caber@[redacted]) has joined #internetarchive.bak
[15:10] *** db48x (~user@[redacted]) has joined #internetarchive.bak
[15:30] *** tephra (~tephra@[redacted]) has left #internetarchive.bak
[15:39] <sep332> geez, 120GB and still going
[15:41] <db48x> what're you working on?
[15:41] <sep332> looking for duplicate files in the census
[15:41] <db48x> :)
[15:43] <midas> dupe items will be pritty large
[15:47] <sep332> it's funny, my intermediate files are repeating the item name and collections for each file, instead of once per item
[15:47] <sep332> so i'm created a lot of redundant data in order to look for redundant data
[15:51] *** Start has quit (Disconnected.)
[15:57] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[15:58] *** Start has quit (Read error: Connection reset by peer)
[15:58] *** johtso (uid563@[redacted]) has joined #internetarchive.bak
[15:58] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[16:20] <sep332> ah and my syntax was wrong again. that's why the files were so big.
[16:20] <sep332> that's good, i didn't want to run "sort" on 200GB files!
[16:27] *** patrickod (~patrickod@[redacted]) has joined #internetarchive.bak
[16:45] *** Start has quit (Disconnected.)
[16:53] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[17:16] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[17:17] *** thunk has quit (Client Quit)
[17:30] <sep332> phew data extracted, sort commenced. 33.2GB this time, much better!
[17:34] *** Start has quit (Read error: Connection reset by peer)
[17:34] *** Start_ (~Start@[redacted]) has joined #internetarchive.bak
[17:35] *** Start_ is now known as Start
[17:43] *** Start has quit (Disconnected.)
[18:42] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[18:44] *** Start has quit (Read error: Connection reset by peer)
[18:44] *** Start_ (~Start@[redacted]) has joined #internetarchive.bak
[19:02] *** Start_ has quit (Disconnected.)
[19:17] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[19:20] *** zottelbey has quit (Ping timeout: 512 seconds)
[19:21] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
[19:27] *** Start has quit (Disconnected.)
[19:32] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[20:17] *** Start has quit (Disconnected.)
[21:02] <sep332> sort and uniq done! 32 million lines, 4.1 GB
[21:02] <sep332> note to self: never use pv with the -l flag, it's super-slow
[21:04] <yipdw> for super long lines yeah
[21:07] <sep332> oh is that what it is? I didn't think these lines were that long but it was using 100% cpu for two hours :p
[21:08] <sep332> it averages 135 characters per line
[21:11] <yipdw> oh hmm
[21:11] <yipdw> odd, pv -l is usually fast enough for me
[21:11] <yipdw> maybe we have different requirements
[21:12] <sep332> i want it to be transparent - if it's slower than doing actual work, it's too slow :)
[21:13] <sep332> it couldn't feel uniq fast enough. which is admittedly a tough job.
[21:14] <yipdw> I often don't feel uniq
[21:16] <sep332> lol i meant feed
[21:17] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[21:18] *** thunk has quit (Client Quit)
[21:52] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
[21:57] <sep332> closure: https://www.dropbox.com/sh/u6trh0ldjuj3k3i/AAD4XDhJG6nhQBW11Q3Gy-FNa?dl=0
[21:57] <sep332> it's the big one lol
[21:58] *** bzc6p has quit (Ping timeout: 600 seconds)
[21:59] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak
[22:00] <sep332> afk for an hour but let me know how it looks
[22:35] <VADemon> sep332, why did you use md5 and not sha1 (possible hash collisions)?
[22:50] <sep332> I don't have the sha1. the census file only has md5
[22:51] <sep332> anyway i doubt there are collisions unless someone was really trying to make a collision. md5 is only weak if you're attacking it
[23:09] <sep332> I ran the file through sort, but it came out scrambled anyway. i could not figure out what it sorted on
[23:09] <sep332> turns out it's a numeric sort of the filename. d'oh!
[23:24] *** X-Scale has quit (Ping timeout: 240 seconds)
[23:33] *** niyaje has quit (Ping timeout: 600 seconds)
[23:47] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak