[00:09] *** Owen-x has quit (Owen-x)
[00:26] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
[00:44] *** tpw_rules (~tpw_rules@[redacted]) has joined #internetarchive.bak
[00:44] <tpw_rules> hey. i heard about you all from twitter. i've got lots of free space i can give
[00:47] <tpw_rules> how can i get my ssh public key added? closure the wiki says to talk to you
[00:54] <tpw_rules> closure: i've got a key to give you
[01:00] <SketchCow> Closure is popping in and out. You might want to just message him and he'll see it when he unidles.
[01:00] <SketchCow> A volunteer stepped forward with 19tb, very kind.
[01:00] <tpw_rules> just put it on a locked pastie or something? i have about 3TB to spare
[01:00] <SketchCow> Of course, bigger numbers are more meaningful as we start considering how to do backups.
[01:00] <SketchCow> tpw_rules: I don't understand locked pastie in this
[01:01] <SketchCow> tpw_rules: Oh, I see. Well, however you'd like.
[01:01] <tpw_rules> it's a really ambitious project. have you talked to anybody in big business? google et al do this kind of thing
[01:08] <closure> tpw_rules: added your key
[01:08] <tpw_rules> cool. then just follow the wiki?
[01:08] <tpw_rules> it doesn't require any incoming ports open does it
[01:09] <closure> tpw_rules: yes, follow the wiki .. and keep in touch since this is just a test
[01:09] <closure> no incoming ports needed, no
[01:09] <tpw_rules> RSA key fingerprint is 79:ea:f9:7f:89:7e:29:27:4c:63:74:53:f9:1c:f3:d4. that you?
[01:10] <tpw_rules> i can do that
[01:10] <tpw_rules> i'll just idle here
[01:11] <closure> that's the right ssh host key,, yes
[01:19] <ersi> tpw_rules: Hehe, welcome around. :)
[01:25] <tpw_rules> i'm getting a 403 forbidden
[01:25] <tpw_rules>   Try making some of these repositories available:
[01:25] <tpw_rules>         00000000-0000-0000-0000-000000000001 -- web
[01:25] <tpw_rules> is that me
[01:27] <tpw_rules> yeah i'm not able to download anything
[01:33] <tpw_rules> reset everything and still 403
[01:35] <tpw_rules> did i break something or is it an issue with archive.org?
[01:38] <closure> must be archive.org (works for me tho)
[01:38] <tpw_rules> http://pastie.org/private/5jtfzil4whdcji0lnecag
[01:39] <tpw_rules> "The item is not available due to issues with the item's content. " so i guess i'm just downloading bad files
[01:39] <tpw_rules> i'll just let it go and get to some good files
[01:40] <closure> always possible they darked a few of the files but if it keeps failing, might be something else on your end
[01:40] <tpw_rules> oh, i changed the command a bit and it's downloading different files okay
[01:41] <tpw_rules> well i'll let that crunch overnight. full repo is ~3TB? i have that space
[01:42] <closure> https://archive.org/download/Ttscribe/Ttscribe_meta.xml is indeed darked
[01:42] <closure> a little less than 3tb I think
[01:43] <sep332> there is a load-balancing issue which might slow your downloads, temporarily anyway
[01:45] <tpw_rules> why is a lot of this stuff tarred instead of compressed too?
[01:49] <tpw_rules> instead of tar.xz or something. ease of access?
[02:05] <sep332> In general, the vast majority of archive items are compressed. not sure about these collections in particular though
[02:10] <sep332> a quick glance shows these tar files are full of .jp2 (JPEG2000) files.
[02:11] <tpw_rules> ahhh. so no sense recompressing them
[02:25] <balrog> ohai tpw_rules
[02:25] <tpw_rules> it might be worth it to create a guide showing how to attach a bunch of spare disks to a raspberry pi or something and set it up to archive
[02:25] <tpw_rules> i'm doing research on using unionfs to tolerate disk failures and i'll see what i can come up wioth
[02:27] <tpw_rules> we all know everybody loves relatively meaningless raspberry pi projects :D
[02:35] <SketchCow> So, two things that are becoming obvious
[02:35] <SketchCow> One, the "backup drive" will have notable curation by a team of us, where we slowly add new items to the "drive", based on historical value and need
[02:35] <SketchCow> Because multiple petabytes are unlikely to fall out of the sky
[02:37] <tpw_rules> are there any sort of "responsible bandwidth limits" for doing something like this by archive.org itself? i can suck 12MB/s down and i don't want to break anything
[02:38] <SketchCow> No, absolutely not
[02:38] <SketchCow> Brewster wants the lines absolutely packed to insane levels all the time.
[02:38] <SketchCow> And then he'll buy more.
[02:38] <SketchCow> We went from 40GB/s to 80GB/s relatively recently
[02:38] <tpw_rules> if you say so :)
[02:39] <SketchCow> I'm trying to see if I can find a metric.
[02:39] <SketchCow> If https://monitor.archive.org/weathermap/weathermap.html is still public
[02:39] <tpw_rules> i have no idea what that means but it looks cool
[02:40] *** balrog wishes btrfs had per-subvolume RAID already
[02:40] <SketchCow> Well, we want yellow. Lots of yellow
[02:41] <SketchCow> And then turning it back to blue
[02:46] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak
[03:22] *** Owen-x has quit (Owen-x)
[03:25] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak
[03:40] *** svchfoo1 has quit (Read error: Operation timed out)
[03:42] *** svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak
[03:43] *** svchfoo2 gives channel operator status to svchfoo1
[03:50] *** Owen-x has quit (Owen-x)
[03:55] <DFJustin> how reliable is an rpi in terms of ram corruption etc
[04:00] <SketchCow> I heard that
[04:29] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
[04:34] *** bzc6p has quit (Ping timeout: 600 seconds)
[04:34] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
[05:06] *** zottelbey has quit (Remote host closed the connection)
[08:09] <midas> SketchCow: speed is getting better, peaks at 1.5MB/s now
[09:03] *** Muad-Dib (~paul@[redacted]) has joined #internetarchive.bak
[10:39] *** svchfoo1 has quit (Remote host closed the connection)
[10:40] *** svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak
[10:43] *** svchfoo2 gives channel operator status to svchfoo1
[10:57] *** csssuf has quit (Ping timeout: 370 seconds)
[10:58] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak
[12:21] <SketchCow> Great
[13:03] *** csssuf has quit (Ping timeout: 370 seconds)
[13:04] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak
[13:44] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
[13:59] *** bpye (~quassel@[redacted]) has joined #internetarchive.bak
[13:59] *** bpye has quit (Remote host closed the connection)
[14:00] *** bpye has quit (Remote host closed the connection)
[14:00] *** bpye (~quassel@[redacted]) has joined #internetarchive.bak
[14:32] <midas> and it slowed down again :p 92.0KB/s
[16:30] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
[16:35] *** bzc6p_ has quit (Read error: Operation timed out)
[17:14] *** patricko- is now known as patrickod
[17:26] *** patrickod is now known as patricko-
[17:50] <SketchCow> How much of it is out there! (Closure had a factoid)
[17:50] <SketchCow> The stats run - how long does it take?
[18:04] *** zottelbey has quit (Remote host closed the connection)
[18:06] *** patricko- is now known as patrickod
[18:08] <closure> hey so I'd like to set up a project on github for cllient-side scripts
[18:09] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
[18:11] *** bzc6p__ has quit (Read error: Operation timed out)
[18:15] <closure> 	numcopies +0: 72540
[18:15] <closure> 	numcopies +1: 30779
[18:15] <closure> 	numcopies +2: 21
[18:15] <closure> 	numcopies +3: 3
[18:15] <closure> these stats are probably out of date.. everyone: git-annex sync
[18:16] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
[18:18] *** bzc6p_ has quit (Read error: Operation timed out)
[18:28] <db48x> closure: do you have a few minutes to look at an error I'm getting when I build git-annex?
[18:29] *** patrickod is now known as patricko-
[18:32] *** patricko- is now known as patrickod
[18:37] *** patrickod is now known as patricko-
[18:49] <SketchCow> Hey, so who was it who reported the 1pb of duplicates by md5?
[18:53] *** bzc6p__ is now known as bzc6p
[18:54] <SketchCow> sep332: Yo
[18:54] <sep332> oh hey
[18:55] <sep332> yeah that was me
[18:56] <SketchCow> Is it an actual report or textfile?
[18:57] <balrog> "by md5" --- ehh...
[18:57] <balrog> suggestion, also compare filesizes
[18:57] <sep332> well i have a list of: count, hash, size
[18:57] <sep332> so i did (count-1) * size to get size of duplicates
[18:58] <SketchCow> So, short form.
[18:58] <SketchCow> it is of interest to Brewster and IA if there is an assessment showing that there is 1pb of duplicate files.
[18:59] <SketchCow> And if it comes in the form of something we can look at.
[18:59] <SketchCow> So, if you have a file that can be "here are items that are the same"
[18:59] <SketchCow> That will be of use specifically.
[19:00] <SketchCow> I'd say a CSV of:
[19:00] <SketchCow> size,item1,item2,item....
[19:00] *** patricko- is now known as patrickod
[19:09] <sep332> balrog: MD5 is still resistant to preimage attacks. checking the size is a good idea though
[19:10] <balrog> IMHO, I'd do more than md5+size to actually confirm duplicates
[19:10] <balrog> I'd probably do compare of the entire data, if it was my drive
[19:10] <sep332> SketchCow: I mostly have lists of individual files, but I can try extracting items if that's more useful
[19:11] <SketchCow> I think items are how we deal.
[19:21] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak
[19:23] *** patrickod is now known as patricko-
[19:43] *** Owen-x has quit (Owen-x)
[19:47] *** csssuf has quit (Ping timeout: 370 seconds)
[19:47] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak
[21:03] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak
[21:15] *** patricko- is now known as patrickod
[21:19] *** Owen-x_ (~owen@[redacted]) has joined #internetarchive.bak
[21:21] *** patrickod is now known as patricko-
[21:21] *** Owen-x has quit (Ping timeout: 186 seconds)
[21:21] *** Owen-x_ is now known as Owen-x
[21:22] *** Owen-x has quit (Client Quit)
[22:26] *** swebb has quit (Quit: badcheese.com - where crap sometimes gets done)
[22:31] *** swebb (~swebb@[redacted]) has joined #internetarchive.bak
[22:32] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak
[22:35] *** patricko- is now known as patrickod
[23:02] *** patrickod is now known as patricko-
[23:02] *** patricko- is now known as patrickod
[23:08] *** Owen-x has quit (Owen-x)
[23:18] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak
[23:26] *** patrickod is now known as patricko-
[23:32] *** Owen-x has quit (Owen-x)