Time |
Nickname |
Message |
00:07
🔗
|
londoncal |
this project is a-go? |
01:42
🔗
|
closure |
sep332: oh, I didn't see you were already using jq .. what command lines did you come up with? |
01:49
🔗
|
|
zottelbey has quit (Remote host closed the connection) |
02:14
🔗
|
|
londoncal has quit (Leaving...) |
02:43
🔗
|
|
niyaje (~niyaje@[redacted]) has joined #internetarchive.bak |
03:16
🔗
|
|
DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak |
03:16
🔗
|
|
svchfoo1 gives channel operator status to DFJustin |
03:31
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
04:14
🔗
|
sep332 |
really? I spent like an hour getting this thing to output csv! |
04:15
🔗
|
sep332 |
jq -r '"\(.files[] | ["\"", .md5, "\"", ",", .size, ",", "\"", .name, "\"" | tostring] | add),\"\(.id)\",\(.collection)"' |
04:18
🔗
|
closure |
yeah, the syntax, it is insane. |
04:18
🔗
|
sep332 |
it's probably fine if you're doing JSON->JSON |
04:19
🔗
|
sep332 |
anyway i'm glad you got it working, it's probably faster on your box than my old AMD server anyway |
04:20
🔗
|
closure |
your dropbox file census_dupes.csv, seems smaller than I expected |
04:23
🔗
|
closure |
32 million files |
04:23
🔗
|
sep332 |
i did get two different numbers somehow, but both right around 34 million |
04:24
🔗
|
closure |
i'm getting 177 million |
04:25
🔗
|
closure |
have not dedupped by md5 yet, but it can't be that many |
04:27
🔗
|
sep332 |
but there are only 271 million files in the whole census. you're saying more than half of the archive is dupes? |
04:28
🔗
|
closure |
I think the 271 number was before the hid the dark and non-downloadable items |
04:28
🔗
|
closure |
177 seems to be the count in the current census |
04:29
🔗
|
sep332 |
oh i don't even have that |
04:31
🔗
|
|
niyaje has quit (Read error: Operation timed out) |
04:34
🔗
|
sep332 |
anyway i'm going to bed, it's past midnight here |
06:27
🔗
|
|
niyaje (~niyaje@[redacted]) has joined #internetarchive.bak |
07:31
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
07:32
🔗
|
|
thunk has quit (Client Quit) |
07:51
🔗
|
|
X-Scale has quit (Remote host closed the connection) |
08:21
🔗
|
|
niyaje has quit (Ping timeout: 600 seconds) |
09:11
🔗
|
|
GLaDOS has quit (Read error: Operation timed out) |
10:20
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
10:20
🔗
|
|
bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak |
10:24
🔗
|
|
bzc6p_ has quit (Ping timeout: 606 seconds) |
11:00
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
11:02
🔗
|
|
bzc6p__ has quit (Ping timeout: 240 seconds) |
11:12
🔗
|
|
GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak |
11:22
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
11:32
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
11:33
🔗
|
|
thunk has quit (Client Quit) |
12:29
🔗
|
|
garyrh (garyrh@[redacted]) has joined #internetarchive.bak |
12:29
🔗
|
|
londoncal has quit (Quit: Leaving...) |
12:29
🔗
|
|
svchfoo1 gives channel operator status to garyrh |
13:16
🔗
|
|
johtso has quit (Quit: Connection closed for inactivity) |
13:40
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
14:03
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
14:04
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
15:39
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
16:33
🔗
|
closure |
157103356 md5_collection_url.txt.pick1.sorted.uniq |
16:33
🔗
|
closure |
177739786 md5_collection_url.txt.pick1.sorted |
16:33
🔗
|
closure |
all right then |
16:37
🔗
|
sep332 |
a difference of 20 million - that means my 34 million is really 40 million? :) |
16:38
🔗
|
sep332 |
or should be |
17:00
🔗
|
closure |
so I grepped out the prelinger collection |
17:00
🔗
|
closure |
30360 prelinger.collection.list |
17:06
🔗
|
closure |
151129 GratefulDead.collection.list |
17:13
🔗
|
|
closure looks for a collection likely to have 70k files |
17:16
🔗
|
closure |
actually, a list of all collections and number of files would be nice |
17:16
🔗
|
closure |
guess I can do that |
17:20
🔗
|
closure |
cut -f 3 md5_collection_url.txt.pick1.sorted.uniq |sort | uniq -c | sort -rn |
17:27
🔗
|
closure |
13157326 opensource_audio |
17:27
🔗
|
closure |
14125353 playdrone-metadata |
17:27
🔗
|
closure |
20723660 uspto |
17:27
🔗
|
closure |
13124639 usfederalcourts |
17:27
🔗
|
closure |
9348116 millionbooks |
17:27
🔗
|
closure |
6891603 null |
17:27
🔗
|
closure |
wow |
17:27
🔗
|
closure |
such files |
17:28
🔗
|
closure |
616356 gutenberg |
17:30
🔗
|
closure |
61118 usenethistorical |
17:30
🔗
|
closure |
aha, that's a nice one :) |
17:38
🔗
|
closure |
(especially since I''ve wanted a git-annex repo of that for other reasons..) |
17:45
🔗
|
|
closure goes to hack a mass-ingest command into git-annex |
17:53
🔗
|
SketchCow |
Morning. |
17:53
🔗
|
closure |
hey! |
17:55
🔗
|
SketchCow |
Glad to see you're racking on it. |
17:55
🔗
|
SketchCow |
Obviously, understanding what exactly should even BE backed up is a big deal. |
17:55
🔗
|
SketchCow |
And it seems we're pretty aggressively getting a grip on the "unknowable" archive.org. |
17:56
🔗
|
SketchCow |
An additional possibility - not doing the open upload areas, for now. |
17:56
🔗
|
SketchCow |
So, like, "opensource" |
17:56
🔗
|
closure |
so my plan for today is to make a git-annex repo containing all of prelinger and usenethistorical collections (100k fies) |
17:59
🔗
|
closure |
as a test |
18:17
🔗
|
closure |
perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); $file=$url; $file=~s/https:\/\/archive.org\/download\///; $file="$coll/$file"; print ("MD5-s".$size."--".$md5." ".$file)' |
18:17
🔗
|
SketchCow |
Great. |
18:18
🔗
|
closure |
| git-annex keyfile |
18:18
🔗
|
closure |
SketchCow: while you're here, could you install some stuff on FOS? I'm going to want to build custom git-annex there. |
18:19
🔗
|
closure |
apt-get build-dep git-annex |
18:20
🔗
|
SketchCow |
message me. I'm having issues adding them, want to make sure I'm doing it right, and not scrolling this channel. |
18:22
🔗
|
closure |
so yeah, that took around 1 minute for 30k files |
18:22
🔗
|
closure |
but I have to add the urls to them all still |
18:29
🔗
|
|
closure adds a git-annex registerurl for mass-url-adding |
18:31
🔗
|
SketchCow |
OK, I am going to walk the SXSW floor. Hopefully I can redo the wiki pages. |
18:32
🔗
|
SketchCow |
But closure, definitely keep going with it. |
18:32
🔗
|
SketchCow |
So, swimming upstream through cranky academic letters I've been getting |
18:32
🔗
|
SketchCow |
I'll send you those later, but one factoid would be nice to calculate (Obviously, we're GETTING this data in doing this test): |
18:32
🔗
|
SketchCow |
Amount of time to download, amount to upload. |
18:33
🔗
|
SketchCow |
So for a restore, how that takes. |
18:36
🔗
|
closure |
\o/ git-annex registerurl implemented |
18:36
🔗
|
closure |
I love my haskell libs |
18:43
🔗
|
closure |
time perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); print ("MD5-s".$size."--".$md5." ".$url)' <p.list | git -c annex.alwayscommit=false annex registerurl |
18:43
🔗
|
closure |
real 2m41.658s |
18:43
🔗
|
closure |
registerurl stdin ok |
18:43
🔗
|
closure |
joey@darkstar:~/tmp/ia/repo1/prelinger/yellowstone_national_park>git annex whereis yellowstone_national_park.mpeg |
18:43
🔗
|
closure |
00000000-0000-0000-0000-000000000001 -- web |
18:43
🔗
|
closure |
web: https://archive.org/download/yellowstone_national_park/yellowstone_national_park.mpeg |
18:43
🔗
|
closure |
ok |
18:43
🔗
|
closure |
whereis yellowstone_national_park.mpeg (1 copy) |
18:43
🔗
|
closure |
so yeah, there's a repo, and it knows how to download from the IA! |
18:44
🔗
|
|
closure goes to get some other, non-prelinger dataset |
18:46
🔗
|
closure |
and.. the .git directory for these 30k files is 38 mb |
18:46
🔗
|
closure |
less space than du says the files (broken symlinks) take up |
19:05
🔗
|
closure |
279M usenethistorical/.git/objects |
19:05
🔗
|
closure |
annexed files in working tree: 61118 |
19:05
🔗
|
closure |
size of annexed files in working tree: 695.28 gigabytes |
19:38
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
19:39
🔗
|
|
thunk has quit (Client Quit) |
20:11
🔗
|
xmc |
nice! |
20:22
🔗
|
|
Now talking on #internetarchive.bak |
20:22
🔗
|
|
Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam |
20:22
🔗
|
|
Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar 4 18:38:46 2015 |
20:23
🔗
|
|
svchfoo1 gives channel operator status to chfoo |
21:01
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
21:22
🔗
|
|
BEGIN LOGGING AT Sun Mar 15 16:22:47 2015 |
21:51
🔗
|
|
espes___ has quit (Remote host closed the connection) |
22:11
🔗
|
|
johtso (uid563@[redacted]) has joined #internetarchive.bak |
22:12
🔗
|
|
garyrh_ has quit (Quit: Leaving) |
22:12
🔗
|
|
londoncal (~londoncal@[redacted]) has joined #internetarchive.bak |
22:14
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
22:24
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |
22:30
🔗
|
|
bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak |
22:36
🔗
|
|
bzc6p_ has quit (Ping timeout: 600 seconds) |
22:45
🔗
|
|
thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) |
22:56
🔗
|
|
thunk (4746deec@[redacted]) has joined #internetarchive.bak |