Time |
Nickname |
Message |
00:06
🔗
|
|
londoncal has quit (Remote host closed the connection) |
00:31
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
00:37
🔗
|
|
Start has quit (Disconnected.) |
00:47
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
01:22
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
01:27
🔗
|
|
Start has quit (Disconnected.) |
01:42
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
01:43
🔗
|
|
svchfoo2 gives channel operator status to Start |
02:01
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
02:04
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
03:21
🔗
|
SketchCow |
they should be in the json. |
03:21
🔗
|
SketchCow |
they are not? |
03:23
🔗
|
sep332 |
no, I don't think any of the files have a sha listed. only md5 |
03:33
🔗
|
SketchCow |
ooo |
03:34
🔗
|
SketchCow |
ok. csv of just the 22 milliom then. |
03:38
🔗
|
sep332 |
ok, i can get that tomorrow |
03:51
🔗
|
SketchCow |
Thanks! |
05:15
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
05:17
🔗
|
|
wp494 has quit (Ping timeout: 740 seconds) |
05:38
🔗
|
|
wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak |
06:58
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
07:00
🔗
|
|
X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak |
07:06
🔗
|
closure |
DFJustin: my current best guess is git-annex will be told the urls on archive.org and not see the file contents until they are downloaded into invidual's drives, so no deduplication in that case |
07:06
🔗
|
closure |
except whatever the sha1s tell us |
07:08
🔗
|
closure |
sep332: a list of files in a form like this would let me start generating test git-annex repos: sha1 size collection item file url |
07:08
🔗
|
closure |
(CSV or something) |
07:16
🔗
|
|
db48x has quit (Read error: Operation timed out) |
07:31
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
08:16
🔗
|
|
londoncal has quit (Quit: Leaving...) |
09:14
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
09:15
🔗
|
|
thunk has quit (Client Quit) |
09:56
🔗
|
|
bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak |
09:59
🔗
|
|
bzc6p_ has quit (Ping timeout: 600 seconds) |
10:25
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
12:40
🔗
|
|
VADemon (~VADemon@[redacted]) has joined #internetarchive.bak |
13:15
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
13:16
🔗
|
|
thunk has quit (Client Quit) |
13:16
🔗
|
|
zottelbey has quit (Ping timeout: 512 seconds) |
13:17
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
13:22
🔗
|
SketchCow |
I had a chat about that websrv-file by the way. |
13:24
🔗
|
SketchCow |
It turns out they intentionally dupe websrv-XXXX files because they're critical indexes. |
13:25
🔗
|
sep332 |
oh ok. well that's good |
13:29
🔗
|
sep332 |
closure: some of these items have multiple collections listed. |
13:29
🔗
|
SketchCow |
Right. Obviously, our backup might turn off half of them. |
13:29
🔗
|
sep332 |
for example the first one in the file: item "Urdu-Trana-001" is in "iraq_middleeast","iraq_war" and "newsandpublicaffairs" |
13:29
🔗
|
SketchCow |
but they'll all be websrv-XXXX-0 and -1 |
13:29
🔗
|
SketchCow |
If you have time, it'd be neat to see how much space that is (and if they match up) |
13:34
🔗
|
|
sep332 (~sep332@[redacted]) has left #internetarchive.bak |
13:34
🔗
|
|
sep332 (~sep332@[redacted]) has joined #internetarchive.bak |
13:34
🔗
|
|
svchfoo2 gives channel operator status to sep332 |
13:39
🔗
|
closure |
sep332: multiples columns for anything except the sha1 will be fine |
14:00
🔗
|
|
bzc6p__ is now known as bzc6p |
14:06
🔗
|
sep332 |
i'm having trouble with jq syntax |
14:07
🔗
|
sep332 |
i can get the attributes of each file on a line, or each item on a line |
14:07
🔗
|
sep332 |
but not both |
14:18
🔗
|
|
Start has quit (Disconnected.) |
14:44
🔗
|
sep332 |
ok I got the syntax. it's just running very slowly |
14:51
🔗
|
|
Laverne (~Laverne@[redacted]) has joined #internetarchive.bak |
15:03
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
15:09
🔗
|
|
caber (~caber@[redacted]) has joined #internetarchive.bak |
15:10
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
15:30
🔗
|
|
tephra (~tephra@[redacted]) has left #internetarchive.bak |
15:39
🔗
|
sep332 |
geez, 120GB and still going |
15:41
🔗
|
db48x |
what're you working on? |
15:41
🔗
|
sep332 |
looking for duplicate files in the census |
15:41
🔗
|
db48x |
:) |
15:43
🔗
|
midas |
dupe items will be pritty large |
15:47
🔗
|
sep332 |
it's funny, my intermediate files are repeating the item name and collections for each file, instead of once per item |
15:47
🔗
|
sep332 |
so i'm created a lot of redundant data in order to look for redundant data |
15:51
🔗
|
|
Start has quit (Disconnected.) |
15:57
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
15:58
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
15:58
🔗
|
|
johtso (uid563@[redacted]) has joined #internetarchive.bak |
15:58
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
16:20
🔗
|
sep332 |
ah and my syntax was wrong again. that's why the files were so big. |
16:20
🔗
|
sep332 |
that's good, i didn't want to run "sort" on 200GB files! |
16:27
🔗
|
|
patrickod (~patrickod@[redacted]) has joined #internetarchive.bak |
16:45
🔗
|
|
Start has quit (Disconnected.) |
16:53
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
17:16
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
17:17
🔗
|
|
thunk has quit (Client Quit) |
17:30
🔗
|
sep332 |
phew data extracted, sort commenced. 33.2GB this time, much better! |
17:34
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
17:34
🔗
|
|
Start_ (~Start@[redacted]) has joined #internetarchive.bak |
17:35
🔗
|
|
Start_ is now known as Start |
17:43
🔗
|
|
Start has quit (Disconnected.) |
18:42
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
18:44
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
18:44
🔗
|
|
Start_ (~Start@[redacted]) has joined #internetarchive.bak |
19:02
🔗
|
|
Start_ has quit (Disconnected.) |
19:17
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
19:20
🔗
|
|
zottelbey has quit (Ping timeout: 512 seconds) |
19:21
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
19:27
🔗
|
|
Start has quit (Disconnected.) |
19:32
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
20:17
🔗
|
|
Start has quit (Disconnected.) |
21:02
🔗
|
sep332 |
sort and uniq done! 32 million lines, 4.1 GB |
21:02
🔗
|
sep332 |
note to self: never use pv with the -l flag, it's super-slow |
21:04
🔗
|
yipdw |
for super long lines yeah |
21:07
🔗
|
sep332 |
oh is that what it is? I didn't think these lines were that long but it was using 100% cpu for two hours :p |
21:08
🔗
|
sep332 |
it averages 135 characters per line |
21:11
🔗
|
yipdw |
oh hmm |
21:11
🔗
|
yipdw |
odd, pv -l is usually fast enough for me |
21:11
🔗
|
yipdw |
maybe we have different requirements |
21:12
🔗
|
sep332 |
i want it to be transparent - if it's slower than doing actual work, it's too slow :) |
21:13
🔗
|
sep332 |
it couldn't feel uniq fast enough. which is admittedly a tough job. |
21:14
🔗
|
yipdw |
I often don't feel uniq |
21:16
🔗
|
sep332 |
lol i meant feed |
21:17
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
21:18
🔗
|
|
thunk has quit (Client Quit) |
21:52
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
21:57
🔗
|
sep332 |
closure: https://www.dropbox.com/sh/u6trh0ldjuj3k3i/AAD4XDhJG6nhQBW11Q3Gy-FNa?dl=0 |
21:57
🔗
|
sep332 |
it's the big one lol |
21:58
🔗
|
|
bzc6p has quit (Ping timeout: 600 seconds) |
21:59
🔗
|
|
niyaje (~niyaje@[redacted]) has joined #internetarchive.bak |
22:00
🔗
|
sep332 |
afk for an hour but let me know how it looks |
22:35
🔗
|
VADemon |
sep332, why did you use md5 and not sha1 (possible hash collisions)? |
22:50
🔗
|
sep332 |
I don't have the sha1. the census file only has md5 |
22:51
🔗
|
sep332 |
anyway i doubt there are collisions unless someone was really trying to make a collision. md5 is only weak if you're attacking it |
23:09
🔗
|
sep332 |
I ran the file through sort, but it came out scrambled anyway. i could not figure out what it sorted on |
23:09
🔗
|
sep332 |
turns out it's a numeric sort of the filename. d'oh! |
23:24
🔗
|
|
X-Scale has quit (Ping timeout: 240 seconds) |
23:33
🔗
|
|
niyaje has quit (Ping timeout: 600 seconds) |
23:47
🔗
|
|
niyaje (~niyaje@[redacted]) has joined #internetarchive.bak |