#internetarchive.bak 2015-03-15,Sun

↑back Search

Time Nickname Message
00:07 🔗 londoncal this project is a-go?
01:42 🔗 closure sep332: oh, I didn't see you were already using jq .. what command lines did you come up with?
01:49 🔗 zottelbey has quit (Remote host closed the connection)
02:14 🔗 londoncal has quit (Leaving...)
02:43 🔗 niyaje (~niyaje@[redacted]) has joined #internetarchive.bak
03:16 🔗 DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
03:16 🔗 svchfoo1 gives channel operator status to DFJustin
03:31 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
04:14 🔗 sep332 really? I spent like an hour getting this thing to output csv!
04:15 🔗 sep332 jq -r '"\(.files[] | ["\"", .md5, "\"", ",", .size, ",", "\"", .name, "\"" | tostring] | add),\"\(.id)\",\(.collection)"'
04:18 🔗 closure yeah, the syntax, it is insane.
04:18 🔗 sep332 it's probably fine if you're doing JSON->JSON
04:19 🔗 sep332 anyway i'm glad you got it working, it's probably faster on your box than my old AMD server anyway
04:20 🔗 closure your dropbox file census_dupes.csv, seems smaller than I expected
04:23 🔗 closure 32 million files
04:23 🔗 sep332 i did get two different numbers somehow, but both right around 34 million
04:24 🔗 closure i'm getting 177 million
04:25 🔗 closure have not dedupped by md5 yet, but it can't be that many
04:27 🔗 sep332 but there are only 271 million files in the whole census. you're saying more than half of the archive is dupes?
04:28 🔗 closure I think the 271 number was before the hid the dark and non-downloadable items
04:28 🔗 closure 177 seems to be the count in the current census
04:29 🔗 sep332 oh i don't even have that
04:31 🔗 niyaje has quit (Read error: Operation timed out)
04:34 🔗 sep332 anyway i'm going to bed, it's past midnight here
06:27 🔗 niyaje (~niyaje@[redacted]) has joined #internetarchive.bak
07:31 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
07:32 🔗 thunk has quit (Client Quit)
07:51 🔗 X-Scale has quit (Remote host closed the connection)
08:21 🔗 niyaje has quit (Ping timeout: 600 seconds)
09:11 🔗 GLaDOS has quit (Read error: Operation timed out)
10:20 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
10:20 🔗 bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
10:24 🔗 bzc6p_ has quit (Ping timeout: 606 seconds)
11:00 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
11:02 🔗 bzc6p__ has quit (Ping timeout: 240 seconds)
11:12 🔗 GLaDOS (~STR_IDENT@[redacted]) has joined #internetarchive.bak
11:22 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
11:32 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
11:33 🔗 thunk has quit (Client Quit)
12:29 🔗 garyrh (garyrh@[redacted]) has joined #internetarchive.bak
12:29 🔗 londoncal has quit (Quit: Leaving...)
12:29 🔗 svchfoo1 gives channel operator status to garyrh
13:16 🔗 johtso has quit (Quit: Connection closed for inactivity)
13:40 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
14:03 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
14:04 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
15:39 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
16:33 🔗 closure 157103356 md5_collection_url.txt.pick1.sorted.uniq
16:33 🔗 closure 177739786 md5_collection_url.txt.pick1.sorted
16:33 🔗 closure all right then
16:37 🔗 sep332 a difference of 20 million - that means my 34 million is really 40 million? :)
16:38 🔗 sep332 or should be
17:00 🔗 closure so I grepped out the prelinger collection
17:00 🔗 closure 30360 prelinger.collection.list
17:06 🔗 closure 151129 GratefulDead.collection.list
17:13 🔗 closure looks for a collection likely to have 70k files
17:16 🔗 closure actually, a list of all collections and number of files would be nice
17:16 🔗 closure guess I can do that
17:20 🔗 closure cut -f 3 md5_collection_url.txt.pick1.sorted.uniq |sort | uniq -c | sort -rn
17:27 🔗 closure 13157326 opensource_audio
17:27 🔗 closure 14125353 playdrone-metadata
17:27 🔗 closure 20723660 uspto
17:27 🔗 closure 13124639 usfederalcourts
17:27 🔗 closure 9348116 millionbooks
17:27 🔗 closure 6891603 null
17:27 🔗 closure wow
17:27 🔗 closure such files
17:28 🔗 closure 616356 gutenberg
17:30 🔗 closure 61118 usenethistorical
17:30 🔗 closure aha, that's a nice one :)
17:38 🔗 closure (especially since I''ve wanted a git-annex repo of that for other reasons..)
17:45 🔗 closure goes to hack a mass-ingest command into git-annex
17:53 🔗 SketchCow Morning.
17:53 🔗 closure hey!
17:55 🔗 SketchCow Glad to see you're racking on it.
17:55 🔗 SketchCow Obviously, understanding what exactly should even BE backed up is a big deal.
17:55 🔗 SketchCow And it seems we're pretty aggressively getting a grip on the "unknowable" archive.org.
17:56 🔗 SketchCow An additional possibility - not doing the open upload areas, for now.
17:56 🔗 SketchCow So, like, "opensource"
17:56 🔗 closure so my plan for today is to make a git-annex repo containing all of prelinger and usenethistorical collections (100k fies)
17:59 🔗 closure as a test
18:17 🔗 closure perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); $file=$url; $file=~s/https:\/\/archive.org\/download\///; $file="$coll/$file"; print ("MD5-s".$size."--".$md5." ".$file)'
18:17 🔗 SketchCow Great.
18:18 🔗 closure | git-annex keyfile
18:18 🔗 closure SketchCow: while you're here, could you install some stuff on FOS? I'm going to want to build custom git-annex there.
18:19 🔗 closure apt-get build-dep git-annex
18:20 🔗 SketchCow message me. I'm having issues adding them, want to make sure I'm doing it right, and not scrolling this channel.
18:22 🔗 closure so yeah, that took around 1 minute for 30k files
18:22 🔗 closure but I have to add the urls to them all still
18:29 🔗 closure adds a git-annex registerurl for mass-url-adding
18:31 🔗 SketchCow OK, I am going to walk the SXSW floor. Hopefully I can redo the wiki pages.
18:32 🔗 SketchCow But closure, definitely keep going with it.
18:32 🔗 SketchCow So, swimming upstream through cranky academic letters I've been getting
18:32 🔗 SketchCow I'll send you those later, but one factoid would be nice to calculate (Obviously, we're GETTING this data in doing this test):
18:32 🔗 SketchCow Amount of time to download, amount to upload.
18:33 🔗 SketchCow So for a restore, how that takes.
18:36 🔗 closure \o/ git-annex registerurl implemented
18:36 🔗 closure I love my haskell libs
18:43 🔗 closure time perl -lne 'my ($md5, $size, $coll, $url)=split("\t", $_,4); print ("MD5-s".$size."--".$md5." ".$url)' <p.list | git -c annex.alwayscommit=false annex registerurl
18:43 🔗 closure real 2m41.658s
18:43 🔗 closure registerurl stdin ok
18:43 🔗 closure joey@darkstar:~/tmp/ia/repo1/prelinger/yellowstone_national_park>git annex whereis yellowstone_national_park.mpeg
18:43 🔗 closure 00000000-0000-0000-0000-000000000001 -- web
18:43 🔗 closure web: https://archive.org/download/yellowstone_national_park/yellowstone_national_park.mpeg
18:43 🔗 closure ok
18:43 🔗 closure whereis yellowstone_national_park.mpeg (1 copy)
18:43 🔗 closure so yeah, there's a repo, and it knows how to download from the IA!
18:44 🔗 closure goes to get some other, non-prelinger dataset
18:46 🔗 closure and.. the .git directory for these 30k files is 38 mb
18:46 🔗 closure less space than du says the files (broken symlinks) take up
19:05 🔗 closure 279M usenethistorical/.git/objects
19:05 🔗 closure annexed files in working tree: 61118
19:05 🔗 closure size of annexed files in working tree: 695.28 gigabytes
19:38 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
19:39 🔗 thunk has quit (Client Quit)
20:11 🔗 xmc nice!
20:22 🔗 Now talking on #internetarchive.bak
20:22 🔗 Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam
20:22 🔗 Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar 4 18:38:46 2015
20:23 🔗 svchfoo1 gives channel operator status to chfoo
21:01 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
21:22 🔗 BEGIN LOGGING AT Sun Mar 15 16:22:47 2015
21:51 🔗 espes___ has quit (Remote host closed the connection)
22:11 🔗 johtso (uid563@[redacted]) has joined #internetarchive.bak
22:12 🔗 garyrh_ has quit (Quit: Leaving)
22:12 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
22:14 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
22:24 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
22:30 🔗 bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
22:36 🔗 bzc6p_ has quit (Ping timeout: 600 seconds)
22:45 🔗 thunk has quit (http://www.kiwiirc.com/ - A hand crafted IRC client)
22:56 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak

irclogger-viewer