#internetarchive.bak 2015-03-12,Thu

↑back Search

Time Nickname Message
00:06 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
00:08 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
00:29 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
01:29 🔗 VADemon has quit (Quit: left4dead)
01:59 🔗 zottelbey has quit (Remote host closed the connection)
06:54 🔗 X-Scale has quit (Ping timeout: 240 seconds)
08:29 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
08:30 🔗 thunk has quit (Client Quit)
09:18 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
09:24 🔗 bzc6p has quit (Ping timeout: 600 seconds)
10:13 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
11:32 🔗 X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
11:44 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
12:30 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
12:31 🔗 thunk has quit (Client Quit)
12:43 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
12:56 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
14:09 🔗 londoncal has quit (Leaving...)
14:28 🔗 Start has quit (Disconnected.)
15:20 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
15:51 🔗 Start has quit (Disconnected.)
16:15 🔗 bzc6p_ is now known as bzc6p
16:56 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
16:57 🔗 thunk has quit (Client Quit)
17:05 🔗 godane has quit (Quit: Leaving.)
17:07 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
17:40 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
17:43 🔗 Start has quit (Client Quit)
17:48 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
18:37 🔗 Start has quit (Disconnected.)
18:45 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
18:46 🔗 Start has quit (Read error: Connection reset by peer)
18:46 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
19:01 🔗 Start has quit (Disconnected.)
19:17 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
19:18 🔗 Start has quit (Read error: Connection reset by peer)
19:30 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
19:31 🔗 Start has quit (Read error: Connection reset by peer)
19:36 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
19:41 🔗 londoncal has quit (Remote host closed the connection)
20:08 🔗 closure SketchCow: awesome work on the census
20:09 🔗 closure especially interesting about the 1pb dups
20:09 🔗 closure totally worth filtering those out
20:10 🔗 closure (erm, assuming a non-malicious md5 collision in this many files is very unlikely, I've not done the math)
20:10 🔗 sep332 i added the bit about dupes. i thought it might help the backup, but at 1PB, it might even be worth IA's time to look into
20:11 🔗 sep332 i'm assuming that someone has made an item full of MD5 collisions just because those are cool, and not maliciously :)
20:12 🔗 sep332 but I'm also assuming that those are small and don't affect the census too much
20:18 🔗 Start has quit (Read error: Connection reset by peer)
20:18 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
20:18 🔗 Sanqui has quit (west.us.hub irc.mzima.net)
20:20 🔗 Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak
20:22 🔗 SketchCow Closure! How's the rouging.
20:22 🔗 Start has quit (Client Quit)
20:24 🔗 closure 2 days out and going amazing. http://scroll.joeyh.name:4242/
20:25 🔗 SketchCow Naturally, I will ride you like a glue horse when you get back.
20:25 🔗 closure should have a little time this WE
20:26 🔗 SketchCow Well, the census should be a good start.
20:26 🔗 SketchCow We can make good test choices.
20:27 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
20:35 🔗 SketchCow I suspect that maybe we should have a list called CANTEVEN.txt
20:35 🔗 SketchCow In it, it's a list of items and why they shouldn't be in a backup.
20:35 🔗 SketchCow (Duplicate of XXXXX, etc.)
20:51 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
20:57 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
20:58 🔗 thunk has quit (Client Quit)
21:00 🔗 sep332 i have a list of 22 million duplicate files. would it be more useful to see this at the level of items instead?
21:19 🔗 Start has quit (Disconnected.)
21:32 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
21:32 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
21:38 🔗 bzc6p has quit (Read error: Operation timed out)
21:56 🔗 DFJustin I'm not super up on how git-annex works but it might actually handle the duplicate files automagically without having to do anything special
21:57 🔗 DFJustin provided they end up in the same shard I guess
22:21 🔗 pikhq If they're 100% dups, it certainly will.
22:21 🔗 pikhq So long as they're in the same shard.
22:43 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
22:50 🔗 goekesmi (~goekesmi@[redacted]) has joined #internetarchive.bak
22:51 🔗 garyrh gives channel operator status to bzc6p_ chfoo closure Ctrl-S
22:51 🔗 garyrh gives channel operator status to midas sep332
22:52 🔗 zottelbey has quit (Remote host closed the connection)
22:53 🔗 sep332 oh, interesting
22:55 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
23:05 🔗 sep332 we're not really going to know how big a shard is ahead of time, huh :-/
23:30 🔗 SketchCow sep332
23:31 🔗 SketchCow could you compare md5 and sha1 of a file?
23:31 🔗 SketchCow that wjll show trje dypes
23:31 🔗 SketchCow that will show true dupes.
23:31 🔗 SketchCow then we should make a csv
23:34 🔗 sep332 i don't have the actual files. do you have sha-1's of the files
23:34 🔗 sep332 ?
23:45 🔗 X-Scale has quit (Ping timeout: 240 seconds)
23:51 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
23:51 🔗 svchfoo2 gives channel operator status to Start

irclogger-viewer