[00:09] *** Owen-x has quit (Owen-x) [00:26] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak [00:44] *** tpw_rules (~tpw_rules@[redacted]) has joined #internetarchive.bak [00:44] hey. i heard about you all from twitter. i've got lots of free space i can give [00:47] how can i get my ssh public key added? closure the wiki says to talk to you [00:54] closure: i've got a key to give you [01:00] Closure is popping in and out. You might want to just message him and he'll see it when he unidles. [01:00] A volunteer stepped forward with 19tb, very kind. [01:00] just put it on a locked pastie or something? i have about 3TB to spare [01:00] Of course, bigger numbers are more meaningful as we start considering how to do backups. [01:00] tpw_rules: I don't understand locked pastie in this [01:01] tpw_rules: Oh, I see. Well, however you'd like. [01:01] it's a really ambitious project. have you talked to anybody in big business? google et al do this kind of thing [01:08] tpw_rules: added your key [01:08] cool. then just follow the wiki? [01:08] it doesn't require any incoming ports open does it [01:09] tpw_rules: yes, follow the wiki .. and keep in touch since this is just a test [01:09] no incoming ports needed, no [01:09] RSA key fingerprint is 79:ea:f9:7f:89:7e:29:27:4c:63:74:53:f9:1c:f3:d4. that you? [01:10] i can do that [01:10] i'll just idle here [01:11] that's the right ssh host key,, yes [01:19] tpw_rules: Hehe, welcome around. :) [01:25] i'm getting a 403 forbidden [01:25] Try making some of these repositories available: [01:25] 00000000-0000-0000-0000-000000000001 -- web [01:25] is that me [01:27] yeah i'm not able to download anything [01:33] reset everything and still 403 [01:35] did i break something or is it an issue with archive.org? [01:38] must be archive.org (works for me tho) [01:38] http://pastie.org/private/5jtfzil4whdcji0lnecag [01:39] "The item is not available due to issues with the item's content. " so i guess i'm just downloading bad files [01:39] i'll just let it go and get to some good files [01:40] always possible they darked a few of the files but if it keeps failing, might be something else on your end [01:40] oh, i changed the command a bit and it's downloading different files okay [01:41] well i'll let that crunch overnight. full repo is ~3TB? i have that space [01:42] https://archive.org/download/Ttscribe/Ttscribe_meta.xml is indeed darked [01:42] a little less than 3tb I think [01:43] there is a load-balancing issue which might slow your downloads, temporarily anyway [01:45] why is a lot of this stuff tarred instead of compressed too? [01:49] instead of tar.xz or something. ease of access? [02:05] In general, the vast majority of archive items are compressed. not sure about these collections in particular though [02:10] a quick glance shows these tar files are full of .jp2 (JPEG2000) files. [02:11] ahhh. so no sense recompressing them [02:25] ohai tpw_rules [02:25] it might be worth it to create a guide showing how to attach a bunch of spare disks to a raspberry pi or something and set it up to archive [02:25] i'm doing research on using unionfs to tolerate disk failures and i'll see what i can come up wioth [02:27] we all know everybody loves relatively meaningless raspberry pi projects :D [02:35] So, two things that are becoming obvious [02:35] One, the "backup drive" will have notable curation by a team of us, where we slowly add new items to the "drive", based on historical value and need [02:35] Because multiple petabytes are unlikely to fall out of the sky [02:37] are there any sort of "responsible bandwidth limits" for doing something like this by archive.org itself? i can suck 12MB/s down and i don't want to break anything [02:38] No, absolutely not [02:38] Brewster wants the lines absolutely packed to insane levels all the time. [02:38] And then he'll buy more. [02:38] We went from 40GB/s to 80GB/s relatively recently [02:38] if you say so :) [02:39] I'm trying to see if I can find a metric. [02:39] If https://monitor.archive.org/weathermap/weathermap.html is still public [02:39] i have no idea what that means but it looks cool [02:40] *** balrog wishes btrfs had per-subvolume RAID already [02:40] Well, we want yellow. Lots of yellow [02:41] And then turning it back to blue [02:46] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak [03:22] *** Owen-x has quit (Owen-x) [03:25] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak [03:40] *** svchfoo1 has quit (Read error: Operation timed out) [03:42] *** svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak [03:43] *** svchfoo2 gives channel operator status to svchfoo1 [03:50] *** Owen-x has quit (Owen-x) [03:55] how reliable is an rpi in terms of ram corruption etc [04:00] I heard that [04:29] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [04:34] *** bzc6p has quit (Ping timeout: 600 seconds) [04:34] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [05:06] *** zottelbey has quit (Remote host closed the connection) [08:09] SketchCow: speed is getting better, peaks at 1.5MB/s now [09:03] *** Muad-Dib (~paul@[redacted]) has joined #internetarchive.bak [10:39] *** svchfoo1 has quit (Remote host closed the connection) [10:40] *** svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak [10:43] *** svchfoo2 gives channel operator status to svchfoo1 [10:57] *** csssuf has quit (Ping timeout: 370 seconds) [10:58] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak [12:21] Great [13:03] *** csssuf has quit (Ping timeout: 370 seconds) [13:04] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak [13:44] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [13:59] *** bpye (~quassel@[redacted]) has joined #internetarchive.bak [13:59] *** bpye has quit (Remote host closed the connection) [14:00] *** bpye has quit (Remote host closed the connection) [14:00] *** bpye (~quassel@[redacted]) has joined #internetarchive.bak [14:32] and it slowed down again :p 92.0KB/s [16:30] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [16:35] *** bzc6p_ has quit (Read error: Operation timed out) [17:14] *** patricko- is now known as patrickod [17:26] *** patrickod is now known as patricko- [17:50] How much of it is out there! (Closure had a factoid) [17:50] The stats run - how long does it take? [18:04] *** zottelbey has quit (Remote host closed the connection) [18:06] *** patricko- is now known as patrickod [18:08] hey so I'd like to set up a project on github for cllient-side scripts [18:09] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [18:11] *** bzc6p__ has quit (Read error: Operation timed out) [18:15] numcopies +0: 72540 [18:15] numcopies +1: 30779 [18:15] numcopies +2: 21 [18:15] numcopies +3: 3 [18:15] these stats are probably out of date.. everyone: git-annex sync [18:16] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [18:18] *** bzc6p_ has quit (Read error: Operation timed out) [18:28] closure: do you have a few minutes to look at an error I'm getting when I build git-annex? [18:29] *** patrickod is now known as patricko- [18:32] *** patricko- is now known as patrickod [18:37] *** patrickod is now known as patricko- [18:49] Hey, so who was it who reported the 1pb of duplicates by md5? [18:53] *** bzc6p__ is now known as bzc6p [18:54] sep332: Yo [18:54] oh hey [18:55] yeah that was me [18:56] Is it an actual report or textfile? [18:57] "by md5" --- ehh... [18:57] suggestion, also compare filesizes [18:57] well i have a list of: count, hash, size [18:57] so i did (count-1) * size to get size of duplicates [18:58] So, short form. [18:58] it is of interest to Brewster and IA if there is an assessment showing that there is 1pb of duplicate files. [18:59] And if it comes in the form of something we can look at. [18:59] So, if you have a file that can be "here are items that are the same" [18:59] That will be of use specifically. [19:00] I'd say a CSV of: [19:00] size,item1,item2,item.... [19:00] *** patricko- is now known as patrickod [19:09] balrog: MD5 is still resistant to preimage attacks. checking the size is a good idea though [19:10] IMHO, I'd do more than md5+size to actually confirm duplicates [19:10] I'd probably do compare of the entire data, if it was my drive [19:10] SketchCow: I mostly have lists of individual files, but I can try extracting items if that's more useful [19:11] I think items are how we deal. [19:21] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak [19:23] *** patrickod is now known as patricko- [19:43] *** Owen-x has quit (Owen-x) [19:47] *** csssuf has quit (Ping timeout: 370 seconds) [19:47] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak [21:03] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak [21:15] *** patricko- is now known as patrickod [21:19] *** Owen-x_ (~owen@[redacted]) has joined #internetarchive.bak [21:21] *** patrickod is now known as patricko- [21:21] *** Owen-x has quit (Ping timeout: 186 seconds) [21:21] *** Owen-x_ is now known as Owen-x [21:22] *** Owen-x has quit (Client Quit) [22:26] *** swebb has quit (Quit: badcheese.com - where crap sometimes gets done) [22:31] *** swebb (~swebb@[redacted]) has joined #internetarchive.bak [22:32] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak [22:35] *** patricko- is now known as patrickod [23:02] *** patrickod is now known as patricko- [23:02] *** patricko- is now known as patrickod [23:08] *** Owen-x has quit (Owen-x) [23:18] *** Owen-x (~owen@[redacted]) has joined #internetarchive.bak [23:26] *** patrickod is now known as patricko- [23:32] *** Owen-x has quit (Owen-x)