[00:06] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [00:08] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [00:29] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [01:29] *** VADemon has quit (Quit: left4dead) [01:59] *** zottelbey has quit (Remote host closed the connection) [06:54] *** X-Scale has quit (Ping timeout: 240 seconds) [08:29] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [08:30] *** thunk has quit (Client Quit) [09:18] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [09:24] *** bzc6p has quit (Ping timeout: 600 seconds) [10:13] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [11:32] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak [11:44] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [12:30] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [12:31] *** thunk has quit (Client Quit) [12:43] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [12:56] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [14:09] *** londoncal has quit (Leaving...) [14:28] *** Start has quit (Disconnected.) [15:20] *** Start (~Start@[redacted]) has joined #internetarchive.bak [15:51] *** Start has quit (Disconnected.) [16:15] *** bzc6p_ is now known as bzc6p [16:56] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [16:57] *** thunk has quit (Client Quit) [17:05] *** godane has quit (Quit: Leaving.) [17:07] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [17:40] *** Start (~Start@[redacted]) has joined #internetarchive.bak [17:43] *** Start has quit (Client Quit) [17:48] *** Start (~Start@[redacted]) has joined #internetarchive.bak [18:37] *** Start has quit (Disconnected.) [18:45] *** Start (~Start@[redacted]) has joined #internetarchive.bak [18:46] *** Start has quit (Read error: Connection reset by peer) [18:46] *** Start (~Start@[redacted]) has joined #internetarchive.bak [19:01] *** Start has quit (Disconnected.) [19:17] *** Start (~Start@[redacted]) has joined #internetarchive.bak [19:18] *** Start has quit (Read error: Connection reset by peer) [19:30] *** Start (~Start@[redacted]) has joined #internetarchive.bak [19:31] *** Start has quit (Read error: Connection reset by peer) [19:36] *** Start (~Start@[redacted]) has joined #internetarchive.bak [19:41] *** londoncal has quit (Remote host closed the connection) [20:08] SketchCow: awesome work on the census [20:09] especially interesting about the 1pb dups [20:09] totally worth filtering those out [20:10] (erm, assuming a non-malicious md5 collision in this many files is very unlikely, I've not done the math) [20:10] i added the bit about dupes. i thought it might help the backup, but at 1PB, it might even be worth IA's time to look into [20:11] i'm assuming that someone has made an item full of MD5 collisions just because those are cool, and not maliciously :) [20:12] but I'm also assuming that those are small and don't affect the census too much [20:18] *** Start has quit (Read error: Connection reset by peer) [20:18] *** Start (~Start@[redacted]) has joined #internetarchive.bak [20:18] *** Sanqui has quit (west.us.hub irc.mzima.net) [20:20] *** Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak [20:22] Closure! How's the rouging. [20:22] *** Start has quit (Client Quit) [20:24] 2 days out and going amazing. http://scroll.joeyh.name:4242/ [20:25] Naturally, I will ride you like a glue horse when you get back. [20:25] should have a little time this WE [20:26] Well, the census should be a good start. [20:26] We can make good test choices. [20:27] *** Start (~Start@[redacted]) has joined #internetarchive.bak [20:35] I suspect that maybe we should have a list called CANTEVEN.txt [20:35] In it, it's a list of items and why they shouldn't be in a backup. [20:35] (Duplicate of XXXXX, etc.) [20:51] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [20:57] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [20:58] *** thunk has quit (Client Quit) [21:00] i have a list of 22 million duplicate files. would it be more useful to see this at the level of items instead? [21:19] *** Start has quit (Disconnected.) [21:32] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [21:32] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [21:38] *** bzc6p has quit (Read error: Operation timed out) [21:56] I'm not super up on how git-annex works but it might actually handle the duplicate files automagically without having to do anything special [21:57] provided they end up in the same shard I guess [22:21] If they're 100% dups, it certainly will. [22:21] So long as they're in the same shard. [22:43] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [22:50] *** goekesmi (~goekesmi@[redacted]) has joined #internetarchive.bak [22:51] *** garyrh gives channel operator status to bzc6p_ chfoo closure Ctrl-S [22:51] *** garyrh gives channel operator status to midas sep332 [22:52] *** zottelbey has quit (Remote host closed the connection) [22:53] oh, interesting [22:55] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [23:05] we're not really going to know how big a shard is ahead of time, huh :-/ [23:30] sep332 [23:31] could you compare md5 and sha1 of a file? [23:31] that wjll show trje dypes [23:31] that will show true dupes. [23:31] then we should make a csv [23:34] i don't have the actual files. do you have sha-1's of the files [23:34] ? [23:45] *** X-Scale has quit (Ping timeout: 240 seconds) [23:51] *** Start (~Start@[redacted]) has joined #internetarchive.bak [23:51] *** svchfoo2 gives channel operator status to Start