Time |
Nickname |
Message |
00:15
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
00:16
🔗
|
|
svchfoo2 gives channel operator status to Start |
00:29
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
00:40
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
01:08
🔗
|
closure |
sep332: looks ok.. is there a way to construct an url from the fields there? |
01:08
🔗
|
closure |
also, I can't tell what if anything you've done to handle commas in filenames.. |
01:09
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
01:53
🔗
|
|
zottelbey has quit (Ping timeout: 512 seconds) |
02:06
🔗
|
sep332 |
the URL is just https://archive.org/download/[itemid]/[filename] |
02:07
🔗
|
sep332 |
would just putting quotes around the filenames help? |
02:08
🔗
|
Ctrl-S |
percent encode? |
02:10
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
02:11
🔗
|
sep332 |
that would be nicer but I don't know how to do that yet |
02:23
🔗
|
Ctrl-S |
wut language? |
02:24
🔗
|
sep332 |
jq |
02:25
🔗
|
sep332 |
http://stedolan.github.io/jq/manual/#Formatstringsandescaping |
02:25
🔗
|
sep332 |
looks promising |
02:25
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
02:29
🔗
|
|
marvinw has quit (Read error: Operation timed out) |
02:37
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
02:38
🔗
|
|
phuzion has quit (Read error: Operation timed out) |
02:41
🔗
|
|
marvinw (~marvinw@[redacted]) has joined #internetarchive.bak |
02:46
🔗
|
|
johtso has quit (Quit: Connection closed for inactivity) |
02:50
🔗
|
|
niyaje has quit (Ping timeout: 600 seconds) |
03:21
🔗
|
|
phuzion (~phuzion@[redacted]) has joined #internetarchive.bak |
03:50
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
04:42
🔗
|
|
db48x has quit (Read error: Connection reset by peer) |
04:44
🔗
|
espes___ |
I'm kinda impressed jq is a 'language' |
05:22
🔗
|
SketchCow |
sep332: What's the conclusions? How much space in duplicates? |
07:49
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
07:50
🔗
|
|
thunk has quit (Client Quit) |
08:27
🔗
|
|
X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak |
09:02
🔗
|
|
GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak |
09:10
🔗
|
|
serapeum (~serapeum@[redacted]) has joined #internetarchive.bak |
09:47
🔗
|
|
bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak |
09:47
🔗
|
|
mhazinsk has quit (hub.efnet.us irc.umich.edu) |
09:47
🔗
|
|
trs80 has quit (hub.efnet.us irc.umich.edu) |
09:52
🔗
|
|
bzc6p_ has quit (Ping timeout: 600 seconds) |
09:53
🔗
|
|
Laverne has quit (Read error: Operation timed out) |
09:53
🔗
|
|
chazchaz has quit (Read error: Operation timed out) |
09:53
🔗
|
|
svchfoo1 has quit (Read error: Operation timed out) |
09:53
🔗
|
|
aschmitz has quit (Read error: Operation timed out) |
09:54
🔗
|
|
shabble has quit (Read error: Operation timed out) |
09:54
🔗
|
|
shabble (~shabble@[redacted]) has joined #internetarchive.bak |
09:55
🔗
|
|
aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak |
09:59
🔗
|
|
swebb has quit (Ping timeout: 369 seconds) |
10:00
🔗
|
|
chazchaz (~chazchaz@[redacted]) has joined #internetarchive.bak |
10:00
🔗
|
|
Laverne (~Laverne@[redacted]) has joined #internetarchive.bak |
10:00
🔗
|
|
swebb (~swebb@[redacted]) has joined #internetarchive.bak |
10:02
🔗
|
|
svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak |
10:02
🔗
|
|
phuzion has quit (Read error: Operation timed out) |
10:02
🔗
|
|
sep332 has quit (Read error: Operation timed out) |
10:02
🔗
|
|
achip has quit (Read error: Operation timed out) |
10:02
🔗
|
|
svchfoo2 gives channel operator status to svchfoo1 |
10:03
🔗
|
|
londonca_ (~londoncal@[redacted]) has joined #internetarchive.bak |
10:03
🔗
|
|
rossdylan has quit (Read error: Operation timed out) |
10:03
🔗
|
|
dirt has quit (Read error: Operation timed out) |
10:03
🔗
|
|
acridAxid has quit (Read error: Operation timed out) |
10:04
🔗
|
|
acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak |
10:04
🔗
|
|
svchfoo1 gives channel operator status to acridAxid |
10:04
🔗
|
|
GLaDOS has quit (Read error: Operation timed out) |
10:05
🔗
|
|
GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak |
10:06
🔗
|
|
hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak |
10:08
🔗
|
|
londoncal has quit (Read error: Operation timed out) |
10:09
🔗
|
|
hatsefla1 has quit (Read error: Operation timed out) |
10:10
🔗
|
|
wp494 has quit (Quit: LOUD UNNECESSARY QUIT MESSAGES) |
10:10
🔗
|
|
Ctrl-S has quit (Read error: Operation timed out) |
10:10
🔗
|
|
bzc6p__ has quit (Read error: Operation timed out) |
10:13
🔗
|
|
sep332 (~sep332@[redacted]) has joined #internetarchive.bak |
10:13
🔗
|
|
dirt (james@[redacted]) has joined #internetarchive.bak |
10:13
🔗
|
|
achip (~thechip@[redacted]) has joined #internetarchive.bak |
10:13
🔗
|
|
svchfoo1 gives channel operator status to sep332 |
10:13
🔗
|
|
bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak |
10:15
🔗
|
|
phuzion (~phuzion@[redacted]) has joined #internetarchive.bak |
10:16
🔗
|
|
acridAxid has quit (Quit: Quitting) |
10:17
🔗
|
|
Ctrl-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak |
10:17
🔗
|
|
svchfoo1 gives channel operator status to Ctrl-S |
10:18
🔗
|
|
wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak |
10:18
🔗
|
|
acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak |
10:19
🔗
|
|
svchfoo2 gives channel operator status to acridAxid |
10:19
🔗
|
|
rossdylan (~rossdylan@[redacted]) has joined #internetarchive.bak |
10:36
🔗
|
|
Kenshin has quit (Ping timeout: 258 seconds) |
10:37
🔗
|
|
irc.umich.edu gives channel operator status to trs80 mhazinsk |
10:37
🔗
|
|
mhazinsk (~matt@[redacted]) has joined #internetarchive.bak |
10:37
🔗
|
|
trs80 (trs80@[redacted]) has joined #internetarchive.bak |
10:53
🔗
|
|
Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak |
10:54
🔗
|
|
svchfoo1 gives channel operator status to Kenshin |
12:06
🔗
|
|
trs80 has quit (Ping timeout: 186 seconds) |
12:08
🔗
|
|
trs80 (~trs80@[redacted]) has joined #internetarchive.bak |
12:11
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
12:41
🔗
|
|
X-Scale has quit (Ping timeout: 240 seconds) |
12:45
🔗
|
|
X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak |
13:49
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
14:17
🔗
|
sep332 |
SketchCow: there are 659 TB of files listed. So more then half of that is duplicated. |
14:57
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
15:20
🔗
|
|
destrudo has quit (Remote host closed the connection) |
15:31
🔗
|
|
destrudo (~destrudo@[redacted]) has joined #internetarchive.bak |
15:35
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
15:39
🔗
|
|
johtso (uid563@[redacted]) has joined #internetarchive.bak |
16:10
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
17:00
🔗
|
|
garyrh has quit (http://bnc4free.com/) |
17:20
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
18:04
🔗
|
|
londonca_ has quit (Quit: Leaving...) |
18:05
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
19:02
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
19:49
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
20:54
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
20:55
🔗
|
|
garyrh_ (~Mithrandi@[redacted]) has joined #internetarchive.bak |
21:30
🔗
|
|
DFJustin has quit (Ping timeout: 740 seconds) |
21:42
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
21:43
🔗
|
closure |
<SketchCow> Anyway, tracey says use md5sum AND sha-1 |
21:44
🔗
|
closure |
still only md5sum in the current census |
21:44
🔗
|
closure |
getting sha1s into it would be really helpful. |
21:45
🔗
|
closure |
git-annex has only supported using md5 since february |
22:00
🔗
|
closure |
jq is what the IA guys are using for this dataset. It's pretty rad! |
22:00
🔗
|
closure |
zcat public-file-size-md_20150304205357.json.gz | ./jq '.files[] | .name, .size' |head |
22:00
🔗
|
closure |
"Sabeeluna-al-jeehed.mp3" |
22:00
🔗
|
closure |
3227238 |
22:00
🔗
|
closure |
"Dilon_kay_Hukmaran.mp3" |
22:00
🔗
|
closure |
3737995 |
22:01
🔗
|
closure |
sep332: I think I'll go with this, makes it easy to parse and deal with any changes in the censues |
22:07
🔗
|
closure |
jq --raw-output '.collection , .id , (.files[] | .name, .size)' |
22:11
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
22:11
🔗
|
closure |
jq --raw-output '.collection , "i:" + .id , (.files[] | "f:" + .name , .size)' |
22:14
🔗
|
closure |
jq --raw-output '.collection , "i:" + .id , (.files[] | "f:" + .name , "s:" + (.size | tostring))' |
22:18
🔗
|
|
bzc6p has quit (Read error: Operation timed out) |
22:44
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
22:47
🔗
|
closure |
jq --raw-output '.id as $foo | .files[] | @uri "https://archive.org/\($foo)/\(.name)"' <-- this is pretty awesome, it's a list of urls! |
22:52
🔗
|
db48x |
:) |
22:56
🔗
|
closure |
kinda weird this 1st item in the census is 404 https://archive.org/details/Urdu-Trana-001 |
22:57
🔗
|
|
rossdylan has quit (Read error: Operation timed out) |
23:20
🔗
|
closure |
argh, this dataset sometimes has id: "string" and sometimes the id is an array. makes it hard to get urls that always work |
23:43
🔗
|
closure |
jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | @uri "\($coll)\thttps://archive.org/\($id)/\(.name)"' |
23:44
🔗
|
closure |
yay, that works for both arrays and not-arrays! |
23:45
🔗
|
closure |
jq --raw-output '(.collection[]? // .collection) as $coll | (.id[]? // .id) as $id | .files[] | @uri "\(.md5)\t\($coll)\thttps://archive.org/\($id)/\(.name)"' |
23:46
🔗
|
|
closure is running that now on the full census to get a md5_collection_url.txt |
23:46
🔗
|
closure |
already 3 million lines in the file, this is fast! |