[00:07] <londoncal> this project is a-go?
[01:42] <closure> sep332: oh, I didn't see you were already using jq .. what command lines did you come up with?
[01:49] *** zottelbey has quit (Remote host closed the connection)
[02:14] *** londoncal has quit (Leaving...)
[02:43] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak
[03:16] *** DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
[03:16] *** svchfoo1 gives channel operator status to DFJustin
[03:31] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[04:14] <sep332> really? I spent like an hour getting this thing to output csv!
[04:15] <sep332> jq -r '"\(.files[] | ["\"", .md5, "\"", ",", .size, ",", "\"", .name, "\"" | tostring] | add),\"\(.id)\",\(.collection)"'
[04:18] <closure> yeah, the syntax, it is insane.
[04:18] <sep332> it's probably fine if you're doing JSON->JSON
[04:19] <sep332> anyway i'm glad you got it working, it's probably faster on your box than my old AMD server anyway
[04:20] <closure> your dropbox file census_dupes.csv, seems smaller than I expected
[04:23] <closure> 32 million files
[04:23] <sep332> i did get two different numbers somehow, but both right around 34 million
[04:24] <closure> i'm getting 177 million
[04:25] <closure> have not dedupped by md5 yet, but it can't be that many
[04:27] <sep332> but there are only 271 million files in the whole census. you're saying more than half of the archive is dupes?
[04:28] <closure> I think the 271 number was before the hid the dark and non-downloadable items
[04:28] <closure> 177 seems to be the count in the current census
[04:29] <sep332> oh i don't even have that
[04:31] *** niyaje has quit (Read error: Operation timed out)
[04:34] <sep332> anyway i'm going to bed, it's past midnight here
[06:27] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak
[07:31] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[07:32] *** thunk has quit (Client Quit)
[07:51] *** X-Scale has quit (Remote host closed the connection)
[08:21] *** niyaje has quit (Ping timeout: 600 seconds)
[09:11] *** GLaDOS has quit (Read error: Operation timed out)
[10:20] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
[10:20] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
[10:24] *** bzc6p_ has quit (Ping timeout: 606 seconds)
[11:00] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
[11:02] *** bzc6p__ has quit (Ping timeout: 240 seconds)
[11:12] *** GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak
[11:22] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
[11:32] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[11:33] *** thunk has quit (Client Quit)
[12:29] *** garyrh (garyrh@[redacted]) has joined #internetarchive.bak
[12:29] *** londoncal has quit (Quit: Leaving...)
[12:29] *** svchfoo1 gives channel operator status to garyrh
[13:16] *** johtso has quit (Quit: Connection closed for inactivity)
[13:40] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[14:03] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[14:04] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[15:39] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[16:33] <closure>   157103356 md5_collection_url.txt.pick1.sorted.uniq
[16:33] <closure>   177739786 md5_collection_url.txt.pick1.sorted
[16:33] <closure> all right then
[16:37] <sep332> a difference of 20 million - that means my 34 million is really 40 million? :)
[16:38] <sep332> or should be
[17:00] <closure> so I grepped out the prelinger collection
[17:00] <closure> 30360 prelinger.collection.list
[17:06] <closure> 151129 GratefulDead.collection.list
[17:13] *** closure looks for a collection likely to have 70k files
[17:16] <closure> actually, a list of all collections and number of files would be nice
[17:16] <closure> guess I can do that
[17:20] <closure> cut -f 3   md5_collection_url.txt.pick1.sorted.uniq |sort | uniq -c | sort -rn
[17:27] <closure> 13157326 opensource_audio
[17:27] <closure> 14125353 playdrone-metadata
[17:27] <closure> 20723660 uspto
[17:27] <closure> 13124639 usfederalcourts
[17:27] <closure> 9348116 millionbooks
[17:27] <closure> 6891603 null
[17:27] <closure> wow
[17:27] <closure> such files
[17:28] <closure>  616356 gutenberg
[17:30] <closure>   61118 usenethistorical
[17:30] <closure> aha, that's a nice one :)
[17:38] <closure> (especially since I''ve wanted a git-annex repo of that for other reasons..)
[17:45] *** closure goes to hack a mass-ingest command into git-annex
[17:53] <SketchCow> Morning.
[17:53] <closure> hey!
[17:55] <SketchCow> Glad to see you're racking on it.
[17:55] <SketchCow> Obviously, understanding what exactly should even BE backed up is a big deal.
[17:55] <SketchCow> And it seems we're pretty aggressively getting a grip on the "unknowable" archive.org.
[17:56] <SketchCow> An additional possibility - not doing the open upload areas, for now.
[17:56] <SketchCow> So, like, "opensource"
[17:56] <closure> so my plan for today is to make a git-annex repo containing all of prelinger and usenethistorical collections (100k fies)
[17:59] <closure> as a test
[18:17] <closure> perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); $file=$url; $file=~s/https:\/\/archive.org\/download\///; $file="$coll/$file"; print ("MD5-s".$size."--".$md5." ".$file)'
[18:17] <SketchCow> Great.
[18:18] <closure>  | git-annex keyfile
[18:18] <closure> SketchCow: while you're here, could you install some stuff on FOS? I'm going to want to build custom git-annex there.
[18:19] <closure> apt-get build-dep git-annex
[18:20] <SketchCow> message me. I'm having issues adding them, want to make sure I'm doing it right, and not scrolling this channel.
[18:22] <closure> so yeah, that took around 1 minute for 30k files
[18:22] <closure> but I have to add the urls to them all still
[18:29] *** closure adds a git-annex registerurl for mass-url-adding
[18:31] <SketchCow> OK, I am going to walk the SXSW floor. Hopefully I can redo the wiki pages.
[18:32] <SketchCow> But closure, definitely keep going with it.
[18:32] <SketchCow> So, swimming upstream through cranky academic letters I've been getting
[18:32] <SketchCow> I'll send you those later, but one factoid would be nice to calculate (Obviously, we're GETTING this data in doing this test):
[18:32] <SketchCow> Amount of time to download, amount to upload.
[18:33] <SketchCow> So for a restore, how that takes.
[18:36] <closure> \o/ git-annex registerurl implemented
[18:36] <closure> I love my haskell libs
[18:43] <closure> time perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); print ("MD5-s".$size."--".$md5." ".$url)' <p.list | git -c annex.alwayscommit=false annex registerurl
[18:43] <closure> real	2m41.658s
[18:43] <closure> registerurl stdin ok
[18:43] <closure> joey@darkstar:~/tmp/ia/repo1/prelinger/yellowstone_national_park>git annex whereis yellowstone_national_park.mpeg
[18:43] <closure>   	00000000-0000-0000-0000-000000000001 -- web
[18:43] <closure>   web: https://archive.org/download/yellowstone_national_park/yellowstone_national_park.mpeg
[18:43] <closure> ok
[18:43] <closure> whereis yellowstone_national_park.mpeg (1 copy)
[18:43] <closure> so yeah, there's a repo, and it knows how to download from the IA!
[18:44] *** closure goes to get some other, non-prelinger dataset
[18:46] <closure> and.. the .git directory for these 30k files is 38 mb
[18:46] <closure> less space than du says the files (broken symlinks) take up
[19:05] <closure> 279M    usenethistorical/.git/objects
[19:05] <closure> annexed files in working tree: 61118
[19:05] <closure> size of annexed files in working tree: 695.28 gigabytes
[19:38] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[19:39] *** thunk has quit (Client Quit)
[20:11] <xmc> nice!
[20:22] *** Now talking on #internetarchive.bak
[20:22] *** Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam
[20:22] *** Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar  4 18:38:46 2015
[20:23] *** svchfoo1 gives channel operator status to chfoo
[21:01] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[21:22] *** BEGIN LOGGING AT Sun Mar 15 16:22:47 2015
[21:51] *** espes___ has quit (Remote host closed the connection)
[22:11] *** johtso (uid563@[redacted]) has joined #internetarchive.bak
[22:12] *** garyrh_ has quit (Quit: Leaving)
[22:12] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
[22:14] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[22:24] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[22:30] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
[22:36] *** bzc6p_ has quit (Ping timeout: 600 seconds)
[22:45] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
[22:56] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak