[00:07] this project is a-go?
[01:42] sep332: oh, I didn't see you were already using jq .. what command lines did you come up with?
[04:14] really? I spent like an hour getting this thing to output csv!
[04:15] jq -r '"\(.files[] | ["\"", .md5, "\"", ",", .size, ",", "\"", .name, "\"" | tostring] | add),\"\(.id)\",\(.collection)"'
[04:18] yeah, the syntax, it is insane. [04:18] it's probably fine if you're doing JSON->JSON [04:19] anyway i'm glad you got it working, it's probably faster on your box than my old AMD server anyway [04:20] your dropbox file census_dupes.csv, seems smaller than I expected [04:23] 32 million files [04:23] i did get two different numbers somehow, but both right around 34 million [04:24] i'm getting 177 million [04:25] have not dedupped by md5 yet, but it can't be that many [04:27] but there are only 271 million files in the whole census. you're saying more than half of the archive is dupes? [04:28] I think the 271 number was before the hid the dark and non-downloadable items
[04:28] 177 seems to be the count in the current census
[04:29] oh i don't even have that
[04:34] anyway i'm going to bed, it's past midnight here [16:33] 157103356 md5_collection_url.txt.pick1.sorted.uniq
[16:33] 177739786 md5_collection_url.txt.pick1.sorted
[16:33] all right then
[16:37] a difference of 20 million - that means my 34 million is really 40 million? :)
[16:38] or should be
[17:00] so I grepped out the prelinger collection
[17:00] 30360 prelinger.collection.list
[17:06] 151129 GratefulDead.collection.list
[17:13] *** closure looks for a collection likely to have 70k files
[17:16] actually, a list of all collections and number of files would be nice
[17:16] guess I can do that
[17:20] cut -f 3 md5_collection_url.txt.pick1.sorted.uniq |sort | uniq -c | sort -rn
[17:27] 13157326 opensource_audio
[17:27] 14125353 playdrone-metadata
[17:27] 20723660 uspto
[17:27] 13124639 usfederalcourts
[17:27] 9348116 millionbooks
[17:27] 6891603 null
[17:27] wow
[17:27] such files
[17:28] 616356 gutenberg
[17:30] 61118 usenethistorical
[17:30] aha, that's a nice one :)
[17:38] (especially since I''ve wanted a git-annex repo of that for other reasons..)
[17:45] *** closure goes to hack a mass-ingest command into git-annex
[17:53] Morning.
[17:53] hey!
[17:55] Glad to see you're racking on it.
[17:55] Obviously, understanding what exactly should even BE backed up is a big deal.
[17:55] And it seems we're pretty aggressively getting a grip on the "unknowable" archive.org.
[17:56] An additional possibility - not doing the open upload areas, for now.
[17:56] So, like, "opensource"
[17:56] so my plan for today is to make a git-annex repo containing all of prelinger and usenethistorical collections (100k fies)
[17:59] as a test [18:17] perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); $file=$url; $file=~s/https:\/\/archive.org\/download\///; $file="$coll/$file"; print ("MD5-s".$size."--".$md5." ".$file)'
[18:17] Great.
[18:18] | git-annex keyfile
[18:18] SketchCow: while you're here, could you install some stuff on FOS? I'm going to want to build custom git-annex there.
[18:19] apt-get build-dep git-annex
[18:20] message me. I'm having issues adding them, want to make sure I'm doing it right, and not scrolling this channel.
[18:22] so yeah, that took around 1 minute for 30k files
[18:22] but I have to add the urls to them all still
[18:29] *** closure adds a git-annex registerurl for mass-url-adding
[18:31] OK, I am going to walk the SXSW floor. Hopefully I can redo the wiki pages.
[18:32] But closure, definitely keep going with it.
[18:32] So, swimming upstream through cranky academic letters I've been getting
[18:32] I'll send you those later, but one factoid would be nice to calculate (Obviously, we're GETTING this data in doing this test):
[18:32] Amount of time to download, amount to upload.
[18:33] So for a restore, how that takes.
[18:36] \o/ git-annex registerurl implemented
[18:36] I love my haskell libs [18:43] time perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); print ("MD5-s".$size."--".$md5." ".$url)'
real 2m41.658s
[18:43] registerurl stdin ok
[18:43] joey@darkstar:~/tmp/ia/repo1/prelinger/yellowstone_national_park>git annex whereis yellowstone_national_park.mpeg
[18:43] 00000000-0000-0000-0000-000000000001 -- web
[18:43] web: https://archive.org/download/yellowstone_national_park/yellowstone_national_park.mpeg
[18:43] ok
[18:43] whereis yellowstone_national_park.mpeg (1 copy)
[18:43] so yeah, there's a repo, and it knows how to download from the IA!
[18:44] *** closure goes to get some other, non-prelinger dataset
[18:46] and.. the .git directory for these 30k files is 38 mb
[18:46] less space than du says the files (broken symlinks) take up [19:05] 279M usenethistorical/.git/objects
[19:05] annexed files in working tree: 61118
[19:05] size of annexed files in working tree: 695.28 gigabytes
[20:11] nice! [20:22] *** Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam