[00:07] this project is a-go? [01:42] sep332: oh, I didn't see you were already using jq .. what command lines did you come up with? [01:49] *** zottelbey has quit (Remote host closed the connection) [02:14] *** londoncal has quit (Leaving...) [02:43] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak [03:16] *** DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak [03:16] *** svchfoo1 gives channel operator status to DFJustin [03:31] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [04:14] really? I spent like an hour getting this thing to output csv! [04:15] jq -r '"\(.files[] | ["\"", .md5, "\"", ",", .size, ",", "\"", .name, "\"" | tostring] | add),\"\(.id)\",\(.collection)"' [04:18] yeah, the syntax, it is insane. [04:18] it's probably fine if you're doing JSON->JSON [04:19] anyway i'm glad you got it working, it's probably faster on your box than my old AMD server anyway [04:20] your dropbox file census_dupes.csv, seems smaller than I expected [04:23] 32 million files [04:23] i did get two different numbers somehow, but both right around 34 million [04:24] i'm getting 177 million [04:25] have not dedupped by md5 yet, but it can't be that many [04:27] but there are only 271 million files in the whole census. you're saying more than half of the archive is dupes? [04:28] I think the 271 number was before the hid the dark and non-downloadable items [04:28] 177 seems to be the count in the current census [04:29] oh i don't even have that [04:31] *** niyaje has quit (Read error: Operation timed out) [04:34] anyway i'm going to bed, it's past midnight here [06:27] *** niyaje (~niyaje@[redacted]) has joined #internetarchive.bak [07:31] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [07:32] *** thunk has quit (Client Quit) [07:51] *** X-Scale has quit (Remote host closed the connection) [08:21] *** niyaje has quit (Ping timeout: 600 seconds) [09:11] *** GLaDOS has quit (Read error: Operation timed out) [10:20] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [10:20] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [10:24] *** bzc6p_ has quit (Ping timeout: 606 seconds) [11:00] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [11:02] *** bzc6p__ has quit (Ping timeout: 240 seconds) [11:12] *** GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak [11:22] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [11:32] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [11:33] *** thunk has quit (Client Quit) [12:29] *** garyrh (garyrh@[redacted]) has joined #internetarchive.bak [12:29] *** londoncal has quit (Quit: Leaving...) [12:29] *** svchfoo1 gives channel operator status to garyrh [13:16] *** johtso has quit (Quit: Connection closed for inactivity) [13:40] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [14:03] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [14:04] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [15:39] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [16:33] 157103356 md5_collection_url.txt.pick1.sorted.uniq [16:33] 177739786 md5_collection_url.txt.pick1.sorted [16:33] all right then [16:37] a difference of 20 million - that means my 34 million is really 40 million? :) [16:38] or should be [17:00] so I grepped out the prelinger collection [17:00] 30360 prelinger.collection.list [17:06] 151129 GratefulDead.collection.list [17:13] *** closure looks for a collection likely to have 70k files [17:16] actually, a list of all collections and number of files would be nice [17:16] guess I can do that [17:20] cut -f 3 md5_collection_url.txt.pick1.sorted.uniq |sort | uniq -c | sort -rn [17:27] 13157326 opensource_audio [17:27] 14125353 playdrone-metadata [17:27] 20723660 uspto [17:27] 13124639 usfederalcourts [17:27] 9348116 millionbooks [17:27] 6891603 null [17:27] wow [17:27] such files [17:28] 616356 gutenberg [17:30] 61118 usenethistorical [17:30] aha, that's a nice one :) [17:38] (especially since I''ve wanted a git-annex repo of that for other reasons..) [17:45] *** closure goes to hack a mass-ingest command into git-annex [17:53] Morning. [17:53] hey! [17:55] Glad to see you're racking on it. [17:55] Obviously, understanding what exactly should even BE backed up is a big deal. [17:55] And it seems we're pretty aggressively getting a grip on the "unknowable" archive.org. [17:56] An additional possibility - not doing the open upload areas, for now. [17:56] So, like, "opensource" [17:56] so my plan for today is to make a git-annex repo containing all of prelinger and usenethistorical collections (100k fies) [17:59] as a test [18:17] perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); $file=$url; $file=~s/https:\/\/archive.org\/download\///; $file="$coll/$file"; print ("MD5-s".$size."--".$md5." ".$file)' [18:17] Great. [18:18] | git-annex keyfile [18:18] SketchCow: while you're here, could you install some stuff on FOS? I'm going to want to build custom git-annex there. [18:19] apt-get build-dep git-annex [18:20] message me. I'm having issues adding them, want to make sure I'm doing it right, and not scrolling this channel. [18:22] so yeah, that took around 1 minute for 30k files [18:22] but I have to add the urls to them all still [18:29] *** closure adds a git-annex registerurl for mass-url-adding [18:31] OK, I am going to walk the SXSW floor. Hopefully I can redo the wiki pages. [18:32] But closure, definitely keep going with it. [18:32] So, swimming upstream through cranky academic letters I've been getting [18:32] I'll send you those later, but one factoid would be nice to calculate (Obviously, we're GETTING this data in doing this test): [18:32] Amount of time to download, amount to upload. [18:33] So for a restore, how that takes. [18:36] \o/ git-annex registerurl implemented [18:36] I love my haskell libs [18:43] time perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); print ("MD5-s".$size."--".$md5." ".$url)' real 2m41.658s [18:43] registerurl stdin ok [18:43] joey@darkstar:~/tmp/ia/repo1/prelinger/yellowstone_national_park>git annex whereis yellowstone_national_park.mpeg [18:43] 00000000-0000-0000-0000-000000000001 -- web [18:43] web: https://archive.org/download/yellowstone_national_park/yellowstone_national_park.mpeg [18:43] ok [18:43] whereis yellowstone_national_park.mpeg (1 copy) [18:43] so yeah, there's a repo, and it knows how to download from the IA! [18:44] *** closure goes to get some other, non-prelinger dataset [18:46] and.. the .git directory for these 30k files is 38 mb [18:46] less space than du says the files (broken symlinks) take up [19:05] 279M usenethistorical/.git/objects [19:05] annexed files in working tree: 61118 [19:05] size of annexed files in working tree: 695.28 gigabytes [19:38] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [19:39] *** thunk has quit (Client Quit) [20:11] nice! [20:22] *** Now talking on #internetarchive.bak [20:22] *** Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam [20:22] *** Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar 4 18:38:46 2015 [20:23] *** svchfoo1 gives channel operator status to chfoo [21:01] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [21:22] *** BEGIN LOGGING AT Sun Mar 15 16:22:47 2015 [21:51] *** espes___ has quit (Remote host closed the connection) [22:11] *** johtso (uid563@[redacted]) has joined #internetarchive.bak [22:12] *** garyrh_ has quit (Quit: Leaving) [22:12] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [22:14] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [22:24] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak [22:30] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [22:36] *** bzc6p_ has quit (Ping timeout: 600 seconds) [22:45] *** thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client) [22:56] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak