#internetarchive.bak 2015-03-13,Fri

↑back Search

Time Nickname Message
00:06 🔗 londoncal has quit (Remote host closed the connection)
00:31 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
00:37 🔗 Start has quit (Disconnected.)
00:47 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
01:22 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
01:27 🔗 Start has quit (Disconnected.)
01:42 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
01:43 🔗 svchfoo2 gives channel operator status to Start
02:01 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
02:04 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
03:21 🔗 SketchCow they should be in the json.
03:21 🔗 SketchCow they are not?
03:23 🔗 sep332 no, I don't think any of the files have a sha listed. only md5
03:33 🔗 SketchCow ooo
03:34 🔗 SketchCow ok. csv of just the 22 milliom then.
03:38 🔗 sep332 ok, i can get that tomorrow
03:51 🔗 SketchCow Thanks!
05:15 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
05:17 🔗 wp494 has quit (Ping timeout: 740 seconds)
05:38 🔗 wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
06:58 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
07:00 🔗 X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
07:06 🔗 closure DFJustin: my current best guess is git-annex will be told the urls on archive.org and not see the file contents until they are downloaded into invidual's drives, so no deduplication in that case
07:06 🔗 closure except whatever the sha1s tell us
07:08 🔗 closure sep332: a list of files in a form like this would let me start generating test git-annex repos: sha1 size collection item file url
07:08 🔗 closure (CSV or something)
07:16 🔗 db48x has quit (Read error: Operation timed out)
07:31 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
08:16 🔗 londoncal has quit (Quit: Leaving...)
09:14 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
09:15 🔗 thunk has quit (Client Quit)
09:56 🔗 bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
09:59 🔗 bzc6p_ has quit (Ping timeout: 600 seconds)
10:25 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
12:40 🔗 VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
13:15 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
13:16 🔗 thunk has quit (Client Quit)
13:16 🔗 zottelbey has quit (Ping timeout: 512 seconds)
13:17 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
13:22 🔗 SketchCow I had a chat about that websrv-file by the way.
13:24 🔗 SketchCow It turns out they intentionally dupe websrv-XXXX files because they're critical indexes.
13:25 🔗 sep332 oh ok. well that's good
13:29 🔗 sep332 closure: some of these items have multiple collections listed.
13:29 🔗 SketchCow Right. Obviously, our backup might turn off half of them.
13:29 🔗 sep332 for example the first one in the file: item "Urdu-Trana-001" is in "iraq_middleeast","iraq_war" and "newsandpublicaffairs"
13:29 🔗 SketchCow but they'll all be websrv-XXXX-0 and -1
13:29 🔗 SketchCow If you have time, it'd be neat to see how much space that is (and if they match up)
13:34 🔗 sep332 (~sep332@[redacted]) has left #internetarchive.bak
13:34 🔗 sep332 (~sep332@[redacted]) has joined #internetarchive.bak
13:34 🔗 svchfoo2 gives channel operator status to sep332
13:39 🔗 closure sep332: multiples columns for anything except the sha1 will be fine
14:00 🔗 bzc6p__ is now known as bzc6p
14:06 🔗 sep332 i'm having trouble with jq syntax
14:07 🔗 sep332 i can get the attributes of each file on a line, or each item on a line
14:07 🔗 sep332 but not both
14:18 🔗 Start has quit (Disconnected.)
14:44 🔗 sep332 ok I got the syntax. it's just running very slowly
14:51 🔗 Laverne (~Laverne@[redacted]) has joined #internetarchive.bak
15:03 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
15:09 🔗 caber (~caber@[redacted]) has joined #internetarchive.bak
15:10 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
15:30 🔗 tephra (~tephra@[redacted]) has left #internetarchive.bak
15:39 🔗 sep332 geez, 120GB and still going
15:41 🔗 db48x what're you working on?
15:41 🔗 sep332 looking for duplicate files in the census
15:41 🔗 db48x :)
15:43 🔗 midas dupe items will be pritty large
15:47 🔗 sep332 it's funny, my intermediate files are repeating the item name and collections for each file, instead of once per item
15:47 🔗 sep332 so i'm created a lot of redundant data in order to look for redundant data
15:51 🔗 Start has quit (Disconnected.)
15:57 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
15:58 🔗 Start has quit (Read error: Connection reset by peer)
15:58 🔗 johtso (uid563@[redacted]) has joined #internetarchive.bak
15:58 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
16:20 🔗 sep332 ah and my syntax was wrong again. that's why the files were so big.
16:20 🔗 sep332 that's good, i didn't want to run "sort" on 200GB files!
16:27 🔗 patrickod (~patrickod@[redacted]) has joined #internetarchive.bak
16:45 🔗 Start has quit (Disconnected.)
16:53 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
17:16 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
17:17 🔗 thunk has quit (Client Quit)
17:30 🔗 sep332 phew data extracted, sort commenced. 33.2GB this time, much better!
17:34 🔗 Start has quit (Read error: Connection reset by peer)
17:34 🔗 Start_ (~Start@[redacted]) has joined #internetarchive.bak
17:35 🔗 Start_ is now known as Start
17:43 🔗 Start has quit (Disconnected.)
18:42 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
18:44 🔗 Start has quit (Read error: Connection reset by peer)
18:44 🔗 Start_ (~Start@[redacted]) has joined #internetarchive.bak
19:02 🔗 Start_ has quit (Disconnected.)
19:17 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
19:20 🔗 zottelbey has quit (Ping timeout: 512 seconds)
19:21 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
19:27 🔗 Start has quit (Disconnected.)
19:32 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
20:17 🔗 Start has quit (Disconnected.)
21:02 🔗 sep332 sort and uniq done! 32 million lines, 4.1 GB
21:02 🔗 sep332 note to self: never use pv with the -l flag, it's super-slow
21:04 🔗 yipdw for super long lines yeah
21:07 🔗 sep332 oh is that what it is? I didn't think these lines were that long but it was using 100% cpu for two hours :p
21:08 🔗 sep332 it averages 135 characters per line
21:11 🔗 yipdw oh hmm
21:11 🔗 yipdw odd, pv -l is usually fast enough for me
21:11 🔗 yipdw maybe we have different requirements
21:12 🔗 sep332 i want it to be transparent - if it's slower than doing actual work, it's too slow :)
21:13 🔗 sep332 it couldn't feel uniq fast enough. which is admittedly a tough job.
21:14 🔗 yipdw I often don't feel uniq
21:16 🔗 sep332 lol i meant feed
21:17 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
21:18 🔗 thunk has quit (Client Quit)
21:52 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
21:57 🔗 sep332 closure: https://www.dropbox.com/sh/u6trh0ldjuj3k3i/AAD4XDhJG6nhQBW11Q3Gy-FNa?dl=0
21:57 🔗 sep332 it's the big one lol
21:58 🔗 bzc6p has quit (Ping timeout: 600 seconds)
21:59 🔗 niyaje (~niyaje@[redacted]) has joined #internetarchive.bak
22:00 🔗 sep332 afk for an hour but let me know how it looks
22:35 🔗 VADemon sep332, why did you use md5 and not sha1 (possible hash collisions)?
22:50 🔗 sep332 I don't have the sha1. the census file only has md5
22:51 🔗 sep332 anyway i doubt there are collisions unless someone was really trying to make a collision. md5 is only weak if you're attacking it
23:09 🔗 sep332 I ran the file through sort, but it came out scrambled anyway. i could not figure out what it sorted on
23:09 🔗 sep332 turns out it's a numeric sort of the filename. d'oh!
23:24 🔗 X-Scale has quit (Ping timeout: 240 seconds)
23:33 🔗 niyaje has quit (Ping timeout: 600 seconds)
23:47 🔗 niyaje (~niyaje@[redacted]) has joined #internetarchive.bak

irclogger-viewer