[00:12] *** Start has joined #internetarchive.bak [00:57] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [01:01] *** Lord_Nigh has joined #internetarchive.bak [01:26] *** Lord_Nigh has quit IRC (Ping timeout: 633 seconds) [01:49] *** Lord_Nigh has joined #internetarchive.bak [01:52] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [02:00] *** Lord_Nigh has joined #internetarchive.bak [02:06] *** Lord_Nigh has quit IRC (Ping timeout: 250 seconds) [02:09] *** Lord_Nigh has joined #internetarchive.bak [02:14] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [02:33] *** Lord_Nigh has joined #internetarchive.bak [03:21] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [03:30] *** Lord_Nigh has joined #internetarchive.bak [05:09] *** Blackout has joined #internetarchive.bak [05:35] *** kyan has quit IRC (Quit: Leaving) [07:21] WHERE [07:21] ARE [07:21] THE [07:21] SHARDMASTERS [07:22] I need you to work with closure. I need you to start assigning items to shards [07:22] would itemlists from census be helpful? [07:22] Somewhat [07:22] https://archive.org/download/archiveteam_census_2016 [07:22] We need to work on these tomorrow [07:26] *** vitzli has joined #internetarchive.bak [07:28] Tomorrow [07:28] We appointed three shardmasters. I expect to hear from them tomorrow or I will find replacements. [07:28] The three shardmasters are HCross2 Kaz and Jess [07:28] JesseW [07:29] Tomorrow or I move faster [07:32] Here I am [07:34] Did you get credentials from Closure to begin assigning shard sets [07:35] I havent [07:35] We need you to do that. [07:35] And then, just start working on these. It's file based, not items based. [07:36] Use the Wiki or Google Docs to make them, if you have to. [07:36] I will contribute all the time needed to suggest collections of higher priority [07:36] I will also begin talking behind the scenes about how to handle web grabs (likely by making encrypted/password protected chunks) [07:39] Will do. I'll go over all the documents now [07:45] closure: 15 mins from work now. When I get in, I'll send you an SSH key [07:53] Good. [07:53] I think I should start a slack too [08:05] Didn't want to wait. Slack created. [08:05] Slacks that are free are always a pain in the ass. I am inviting the shardmasters, closure and then in the future we will use it to reach out to people who have access to a lot of disk space but just don't deal with IRC as much as slack [08:17] *** atomotic has joined #internetarchive.bak [08:17] So please coordinate with closure when he wakes (I can e-mail him if he's not checking IRC) and we can begin designing shards, and then I will make a call out to a set of people to help back things up [08:19] But we have 12 petabytes to coordinate and we should get on that hardcore [08:19] Ugg. Been away; and a number of my bits of shards have gone offline and expired. I'll get them back online over the rest of the week. [08:19] Please do [08:19] 12PB is going to take a lot of volunteers [08:19] I also want us to please create documentation for people to read and re-read as needed to keep track. [08:19] Perhaps a readthedocs [08:44] *** ivan has joined #internetarchive.bak [08:44] *** zhongfu has joined #internetarchive.bak [10:09] *** kurt has joined #internetarchive.bak [10:10] Closure idle for 11 days, doesn't look promising [11:03] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:33] *** atomotic has joined #internetarchive.bak [11:48] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [12:42] *** atomotic has joined #internetarchive.bak [12:49] *** VADemon has joined #internetarchive.bak [14:13] *** VADemon has quit IRC (Read error: Operation timed out) [14:16] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [14:20] *** Deewiant has joined #internetarchive.bak [14:31] *** vitzli has quit IRC (Quit: Leaving) [14:56] *** atomotic has joined #internetarchive.bak [14:58] *** Atom has joined #internetarchive.bak [15:27] *** Start has quit IRC (Quit: Disconnected.) [16:08] We'll deal. [16:09] Kaz and I'll find JesseW [16:19] --------------------------------------------- [16:19] Who in this channel can step forward with help with client-coding or configuring [16:19] --------------------------------------------- [16:31] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [16:40] @SketchCow: I can help out after this weekend. Is there is list of things that need to be done? Not seeing one on the IA.BAK wiki page [16:42] SketchCow: I could probably help (well, once I finish recovering from the shock of President Trump) [16:42] What needs to be done? [16:45] The whole IA.BAK project wasn't mothballed, but it was in "see it running" mode [16:45] Now it is not. [16:46] What I want is a group of people willing to step in and talk with people who have disk space, to help them get on the project [16:47] Brewster and I chatted. He is tacitly fine with this. [16:52] Ok, so recruiting people in. What is there to do software/infrastructure-wise? I'm not the best "people person"... [16:52] Ditto [16:52] I'm a people person. [16:53] :) [16:53] I'm a human [16:53] Kaz. Shardmastering. Need you on it stat. [16:53] ping me an e-mail address [16:53] yes [16:53] but Closure [16:54] I can reach Closure. [16:54] okay [16:54] Ping me an e-mail. [16:54] I need to start assembling people with disk space. 50tb folks. [16:54] just need my pubkey? [16:54] No, I am not doing that. I need your e-mail so I can have you on the slack [16:55] iabak"kurtmclester.com [16:55] bloody keyboard layout [16:55] iabak@kurtmclester.com [16:55] Invited [16:58] SketchCow: do we aim for fewer people with lots of storage or more people with less though? We would need >600 people each with 50TB to backup the 30PB archive just once, so easily over 1500 people to give most items triple redundancy. [16:58] All with 50TB [16:58] Several things. [16:59] First, it's not 30pb [16:59] It's more like.... 12 public facing, 15 wayback [16:59] We're going after public facing initially [16:59] Second, I agree, this is relatively difficult to aim for [17:00] Luckily, there's material that we can skip over [17:00] Hence Shardmasters, and not just start at AAAAAAA.txt (0000000.txt depending on your system) and moving forward [17:02] Examples of materials we can skip over: duplicates of television shows, spam [17:04] Ok, fair points. Even with that in mind though, let's assume that there's 8PB of stuff we want to have triple redundancy on. That's still ~500 people with 50TB each. That's more in the realm of plausibility, but my main point is I think it would be more worthwhile to try and get a lot of people with just like a couple 2TB external HDDs or whatever rather than focus on people with huge disk arrays. More likely (IMHO) that we could get the [17:04] amount of storage needed that way. [17:04] Yes, but [17:05] You do realize it's possible to SEEK OUT group A while ALSO SEEKING OUT group B [17:05] Group A preferred, Group B nice [17:05] Group B also comes with a lot more support needs [17:05] Oh no my drive broke, oh no why does it not sync [17:05] Hence I am trying to build an actual support structure this time. [17:06] But saying '50tb minimum' may be a useful "you must be this tall" measure - people who are more able to offer a tiny amount of space may also be more likely to churn out, disappear, get lost, etc. Whereas someone standing up 50tb is doing so for a reason and is (hopefully) less likely to disappear on a whim. [17:06] ... which is something we're going to have to deal with in the long run anyways [17:06] Yes I just don't see a way to even get close to enough storage if we set the min to like 50TB [17:06] PLease stand over here [17:06] Next to the group of people who told me my projects seemed unrealistically attainable [17:08] Look, I'm not saying that the whole thing is unrealistic at all. Just that setting the bar so high I think will diminish the likelihood of it happening. [17:08] Please prove me wrong though [17:08] I would love to see it [17:12] On it [17:12] Anyways, back to the original question: what needs to be done software-wise? [17:17] Our client for IA.BAK can use refinement/flexibility for a download page. [17:17] So the time from "find this" to "install" is as short as the Warrior. [17:18] Docs writer coming in. [17:20] *** cmaldonad has joined #internetarchive.bak [17:20] Hello, cmaldonad [17:20] <--- Jason [17:20] hi SketchCow [17:20] Website: http://iabak.archiveteam.org/ [17:20] ---> kami here [17:20] reading that [17:21] Wikipage: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK [17:23] I see an additional concern that is not listed [17:24] Always up for hearing it [17:24] Periodical Restore Rehearsal events [17:24] That's actually in there, but not listed [17:24] otherwise there's no way that your restore plans are useful/adequate [17:24] ok [17:24] Sorry, in there in a "built into the system" [17:24] I get the idea of the project [17:24] ok [17:25] So, we have a client people install, and I'd like to begin a documentation set related to it, to help ramp. [17:25] ok [17:26] Anything written, put in a public place (google docs, or the wiki) to begin to build that framework [17:26] The people we' [17:26] it's important to have an idea of the audience [17:26] The people we're dealing with initially will be comfortable but we quickly move to a situation where people who are more "just what do I type" come in [17:26] Initially, Unix nerds, then ultimately, it's a client in windows and other systems that people are plugging removable hard drives into [17:27] I am a Sys Admin, but I've written docs for entry level staff, so I consider myself useful for either audience (high or low technical skill level) [17:27] The beginning of a framework based on what's written should work. I can answer questions as can others. [17:27] If you find gaps or want links, we can help [17:29] I am reading this http://tracker.archiveteam.org at the moment [17:30] You got it [17:30] Tracker is currently a separate project but worth seeing since it came from the same people. [17:32] and now I jumped to this http://git-annex.branchable.com [17:41] Can I see an example shard please, so I can get an idea on what to do? [17:41] I'm also happy if people have servers/space and want me to configure it all [17:48] "A script can do this using the git annex fromkey and git annex registerurl commands. Time to make such a repository with 100k files is in the 10 minute range (faster on SSD or randisk)." [17:48] example values for this section would help [17:59] HCross2: an example shard is http://iabak.archiveteam.org/SHARD1.html [17:59] I meant the actual contents of the shard file [18:01] it's not a file, it's a git repository [18:01] to create the repository we first make a list of collections, then use a script to enumerate their contents, adding each item in the collection to the repository [18:04] https://github.com/ArchiveTeam/IA.BAK/blob/server/mkSHARD [18:05] at some point I will need to make sure that the ia.bak code will run on FreeBSD (since that's where all my storage is), so I will try to get some time in to look at the code [18:08] Thanks db48x [18:09] you're welcome [18:09] I'll see about writing up a set of instructions on how to create shards [18:19] we aimed to have about 100,000 files adding up to between 2 and 5 TB in each shard [18:25] I need a secondary/majordomo/co-organizer for this project. [18:25] Someone who is also on here a lot and can help answer so stuff doesn't linger. [18:35] I guess I can do that; I've run this code before [18:37] someone with write access to wiki, can you add CGI as perl dependency ? [18:38] Mostly, I want, as a metric, for valid questions in this channel to be answered in 15 minutes if possible. [18:38] If it takes this being a big priority, I get it. I just don't want things lingering. [18:38] For example: Meroje: No. [18:38] See? I got back to them in 60 seconds. [18:38] great [18:40] cmaldonad: Please mail me at jason@textfiles.com if you run into issues with the framework [18:40] My schedule: Broadway show tonight, travel to DC tomorrow, working in warehouses for 3 days, back up [18:44] Meroje: why don't you have write access to the wiki? [18:45] SketchCow, will do [18:46] testing [18:46] ok, timestamps enabled here [18:47] Yes. [18:52] *** dfboyd has joined #internetarchive.bak [18:58] A contributor with TB is coming on here soon, we can work with him and see how the onboarding is [19:00] TB ? [19:02] Terabytes [19:02] And Tubercluosis [19:02] A user with both disk space a debilitating lung disease [19:03] Johnny Pneumonic [19:04] In case it comes up: I ran the numbers on Amazon Glacier. It would take 430 Snowball servers to move 21P; stored in Amazon Glacier it would cost $154,140.67 a month in us-east-1 or their other less-expensive clusters. [19:05] Those numbers were run some time ago [19:05] But agreed, we found it not workable [19:05] Even with Glacier [19:05] dfboyd and cmaldonad - Docs [19:05] cmaldonad: dfboyd has stepped forward to run second if you need verbiage or research [19:06] thanks [19:06] but do we have a list of pending documents to write, or should I make a decision as we determine what is needed as we get new people contributing space and generating questions? [19:07] cmaldonad: we're not so organized that we have a list of documents that are yet to be written [19:08] I guess you could add it to the list of documents to write [19:08] I can give a few TB [19:09] I think the priority is "Someone wanders in from the street with a pile of drive space and an existensial fear for the archive's data" [19:17] We could start with a "Jumpstart to your own shard" [19:17] and there's also that comes to mind [19:18] like a Matrix that would allow people to know how they can best contribute with whatever space available they have [19:18] but I need to read more about tech details on that [19:18] the Jumpstart is a good way to start [19:22] sounds good [19:22] ask me questions and I'll answer them [19:25] *** Kksmkrn has joined #internetarchive.bak [19:29] *** kyan has joined #internetarchive.bak [19:33] ok db48x, thanks [19:33] I am afk to make lunch [19:37] If you assume the average space volunteer has 1TB, then you need 21,000 of them just to have 1x coverage. You probably want 3x coverage: 60,000 people. Suppose the average contributor is able to drop $500 and get 10 x 1TB hard drives, then great, you only ned 6000 people? [19:38] yea, it's a problem [19:38] This "I ran the numbers guyz" thing is adorable [19:38] I'll work on having a cohesive response [19:39] Which means not just a few dedicated volunteers, it means a mass volunteer thing; you need not just hackers and hobbyist engineers, you need retirees and moms and church groups or whatever? [19:39] what I really "want" to do is write a nice windows desktop application, to make adoption easier [19:39] but "want" and "windows desktop app" don't really go together [19:39] As long as you're thinking about it already, I won't keep going on about it. You have fingers, you can do arithmetic. [19:39] this should be as easy as "Seti@Home" was [19:40] A very cohesive response [19:40] cmaldonad: agreed [19:40] I know it's not the current status, but that's one of the biggest distributed projects that had success [19:41] it would be nice too if it were somehow made clear that this isn't a theoretical thing; there's backups out there right now and the point now is to get more [19:42] uhm [19:43] dfboyd, is your programming background in Windows or plain C (Unix/Linux variants)? [19:43] My only other idea that I want to ask about is: suppose the end-user just needs to do the following: 1. download a program of some kind and run it on their PC; they just have to tell it how much storage it's allowed to use. 2. What the program does is, behaves like an HDFS DataNode or a GFS chunkserver: it just checks in to the master and says, "I have XX GB available". 3. The master saves a collection of data blocks to that client; every so oft [19:43] that's what the current client does [19:43] I am plain C (Unix/Linux), Python; not Windows-knowledgeable. [19:43] git-annex is a bit rough on Windows [19:44] there's path-length-limit issues [19:44] but it can work [19:44] However these days one writes cross-platform apps using Electron, which is basically a menu-bar-less browser that runs Javascript apps. That's how the Slack chat client is made. And I do know CLojurescript. [19:44] dfboyd, then it would be a lot more feasible to have a Raspbian based image that brings up a shard node by booting a Raspberry Pi and a wizard asking what drive to use [19:46] Does that mean people have to buy a Raspberry Pi and a hard drive? [19:46] it would [19:46] but it would be a zero-config deplouyment [19:47] deployment* [19:47] (i.e. one can't just participate by running some background program on one's ordinary desktop PC). [19:47] I don't know if you are looking to: - zero config or wide adoption through reutilization [19:47] it would be just one option to deploy [19:48] I just see Windows desktop set ups as very fragile [19:48] say, they would use space probably shared by the Windows installation, most people don't partition OS a in a diff partition than data [19:49] cmaldonad: It helps to understand the nature of git-annex and why I specifically chose that for this [19:50] fragility can be dealt with; the current system already accounts for that [19:50] Example: Drives are able to be offlined [19:50] And verified at times [19:50] ok [19:50] (This is why there's an "aging" system already built in: notice how we classify people by last checkins) [19:50] Idea being someone puts a drive into a bay once a month and it goes whiirrrr and spits them out saying 'thanks' [19:50] And if it fails, it piles back into the red [19:51] ok [19:51] got it [19:51] also, with git-annex the users aren't downloading random anonymous chunks, they're downloading a random selection of ordinary files that they can just use normally [19:51] images, music, magazines, whatever [19:52] in principle they can pick and choose which files they want at any time, if they can use the command line [19:52] that's a high motivation factor that should be highlighted [19:52] the hypothetical gui app would make that nicer for most people [19:56] and then there are the 50GB warc files that require specialized tools to use, so the HGA won't help much [19:56] but we're not backing those up yet, so we can just not mention that in the press releases [20:01] I too require lunch [20:01] back soon (herbacious) [20:09] *** atomotic has joined #internetarchive.bak [20:11] *** Kksmkrn has quit IRC (Ping timeout: 250 seconds) [20:11] *** boyd has joined #internetarchive.bak [20:11] *** dfboyd has quit IRC (Quit: Page closed) [20:12] *** Kksmkrn has joined #internetarchive.bak [20:17] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [20:33] *** VADemon has joined #internetarchive.bak [20:43] *** kyan has quit IRC (Quit: Leaving) [20:46] HCross2: if it will help you on your quest, i still have collection total sizes and an item list with totalsize/files/collections from jan too (as well the itemlists i mentioned earlier) [20:47] i can slice and dice/sort if needed, if not, i will shut up :) [20:52] *** SketchPho has joined #internetarchive.bak [20:53] Hey. [21:01] I've added my phone client to this channel so that I can be more easily reached if needed [21:01] I'm going to keep out of the other channels [21:17] Yard Masters, please make archive bot collection followed by General archive team collection the next shards [21:18] understood [21:27] *** Start has joined #internetarchive.bak [21:30] *** Kksmkrn has quit IRC (Ping timeout: 250 seconds) [21:31] *** Kksmkrn has joined #internetarchive.bak [21:32] *** Kksmkrn has left [23:06] *** cmaldonad has quit IRC (Quit: This computer has gone to sleep) [23:21] *** Lord_Nigh has quit IRC (Ping timeout: 250 seconds) [23:25] *** Lord_Nigh has joined #internetarchive.bak [23:27] *** Lord_Nigh has quit IRC (Excess Flood) [23:29] *** Lord_Nigh has joined #internetarchive.bak [23:37] *** bwn has quit IRC (Ping timeout: 244 seconds) [23:45] *** bwn has joined #internetarchive.bak [23:58] *** VADemon has quit IRC (Quit: left4dead)