Time |
Nickname |
Message |
00:06
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
00:08
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
00:29
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
01:29
🔗
|
|
VADemon has quit (Quit: left4dead) |
01:59
🔗
|
|
zottelbey has quit (Remote host closed the connection) |
06:54
🔗
|
|
X-Scale has quit (Ping timeout: 240 seconds) |
08:29
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
08:30
🔗
|
|
thunk has quit (Client Quit) |
09:18
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
09:24
🔗
|
|
bzc6p has quit (Ping timeout: 600 seconds) |
10:13
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
11:32
🔗
|
|
X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak |
11:44
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
12:30
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
12:31
🔗
|
|
thunk has quit (Client Quit) |
12:43
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
12:56
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
14:09
🔗
|
|
londoncal has quit (Leaving...) |
14:28
🔗
|
|
Start has quit (Disconnected.) |
15:20
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
15:51
🔗
|
|
Start has quit (Disconnected.) |
16:15
🔗
|
|
bzc6p_ is now known as bzc6p |
16:56
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
16:57
🔗
|
|
thunk has quit (Client Quit) |
17:05
🔗
|
|
godane has quit (Quit: Leaving.) |
17:07
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
17:40
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
17:43
🔗
|
|
Start has quit (Client Quit) |
17:48
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
18:37
🔗
|
|
Start has quit (Disconnected.) |
18:45
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
18:46
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
18:46
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
19:01
🔗
|
|
Start has quit (Disconnected.) |
19:17
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
19:18
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
19:30
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
19:31
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
19:36
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
19:41
🔗
|
|
londoncal has quit (Remote host closed the connection) |
20:08
🔗
|
closure |
SketchCow: awesome work on the census |
20:09
🔗
|
closure |
especially interesting about the 1pb dups |
20:09
🔗
|
closure |
totally worth filtering those out |
20:10
🔗
|
closure |
(erm, assuming a non-malicious md5 collision in this many files is very unlikely, I've not done the math) |
20:10
🔗
|
sep332 |
i added the bit about dupes. i thought it might help the backup, but at 1PB, it might even be worth IA's time to look into |
20:11
🔗
|
sep332 |
i'm assuming that someone has made an item full of MD5 collisions just because those are cool, and not maliciously :) |
20:12
🔗
|
sep332 |
but I'm also assuming that those are small and don't affect the census too much |
20:18
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
20:18
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
20:18
🔗
|
|
Sanqui has quit (west.us.hub irc.mzima.net) |
20:20
🔗
|
|
Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak |
20:22
🔗
|
SketchCow |
Closure! How's the rouging. |
20:22
🔗
|
|
Start has quit (Client Quit) |
20:24
🔗
|
closure |
2 days out and going amazing. http://scroll.joeyh.name:4242/ |
20:25
🔗
|
SketchCow |
Naturally, I will ride you like a glue horse when you get back. |
20:25
🔗
|
closure |
should have a little time this WE |
20:26
🔗
|
SketchCow |
Well, the census should be a good start. |
20:26
🔗
|
SketchCow |
We can make good test choices. |
20:27
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
20:35
🔗
|
SketchCow |
I suspect that maybe we should have a list called CANTEVEN.txt |
20:35
🔗
|
SketchCow |
In it, it's a list of items and why they shouldn't be in a backup. |
20:35
🔗
|
SketchCow |
(Duplicate of XXXXX, etc.) |
20:51
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
20:57
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
20:58
🔗
|
|
thunk has quit (Client Quit) |
21:00
🔗
|
sep332 |
i have a list of 22 million duplicate files. would it be more useful to see this at the level of items instead? |
21:19
🔗
|
|
Start has quit (Disconnected.) |
21:32
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
21:32
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
21:38
🔗
|
|
bzc6p has quit (Read error: Operation timed out) |
21:56
🔗
|
DFJustin |
I'm not super up on how git-annex works but it might actually handle the duplicate files automagically without having to do anything special |
21:57
🔗
|
DFJustin |
provided they end up in the same shard I guess |
22:21
🔗
|
pikhq |
If they're 100% dups, it certainly will. |
22:21
🔗
|
pikhq |
So long as they're in the same shard. |
22:43
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
22:50
🔗
|
|
goekesmi (~goekesmi@[redacted]) has joined #internetarchive.bak |
22:51
🔗
|
|
garyrh gives channel operator status to bzc6p_ chfoo closure Ctrl-S |
22:51
🔗
|
|
garyrh gives channel operator status to midas sep332 |
22:52
🔗
|
|
zottelbey has quit (Remote host closed the connection) |
22:53
🔗
|
sep332 |
oh, interesting |
22:55
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
23:05
🔗
|
sep332 |
we're not really going to know how big a shard is ahead of time, huh :-/ |
23:30
🔗
|
SketchCow |
sep332 |
23:31
🔗
|
SketchCow |
could you compare md5 and sha1 of a file? |
23:31
🔗
|
SketchCow |
that wjll show trje dypes |
23:31
🔗
|
SketchCow |
that will show true dupes. |
23:31
🔗
|
SketchCow |
then we should make a csv |
23:34
🔗
|
sep332 |
i don't have the actual files. do you have sha-1's of the files |
23:34
🔗
|
sep332 |
? |
23:45
🔗
|
|
X-Scale has quit (Ping timeout: 240 seconds) |
23:51
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
23:51
🔗
|
|
svchfoo2 gives channel operator status to Start |