[00:08] *** CyberJaco is now known as zz_CyberJ [01:42] *** ZeroDogg has quit IRC (Read error: Operation timed out) [02:28] *** mntasauri has quit IRC (Ping timeout: 369 seconds) [02:38] *** mntasauri has joined #internetarchive.bak [02:55] *** primus104 has quit IRC (Leaving.) [03:15] *** Erkan has quit IRC (Read error: Connection reset by peer) [03:27] *** Erkan has joined #internetarchive.bak [06:07] *** zz_CyberJ is now known as CyberJaco [06:46] *** Zero_Dogg has joined #internetarchive.bak [08:12] *** primus104 has joined #internetarchive.bak [10:31] *** ohhdemgir has joined #internetarchive.bak [10:40] *** atomotic has joined #internetarchive.bak [11:45] *** primus104 has quit IRC (Leaving.) [12:03] *** mariusz_ has joined #internetarchive.bak [13:16] *** CyberJaco is now known as zz_CyberJ [13:36] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [14:07] *** zz_CyberJ is now known as CyberJaco [14:22] *** primus104 has joined #internetarchive.bak [14:31] *** CyberJaco is now known as zz_CyberJ [14:47] *** primus104 has quit IRC (Leaving.) [15:32] *** atomotic has joined #internetarchive.bak [15:37] *** SaltyBob has quit IRC (WeeChat 0.4.2) [16:11] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [16:12] ------------------------------------------------------------------------------------- [16:12] OKAY TIME TO GET THE PROJECT BACK ON TRACK [16:12] Closure, what's your time schedule look like, and what's our showstoppers [16:12] ------------------------------------------------------------------------------------- [16:12] Also, do we need more people coding or do we need more people contributing? [16:12] We haven't made up for lost space. [16:16] There've been very few new registrations recently; I think we need more people. [16:16] Also some of those previously-full shards probably need to go on active [16:16] I am all for doing that, but I notice we have a problem with old shards not coming back. [16:17] Well, they're not coming back because noone's downloading [16:17] I feel like we might need some sort of routine that checks old shards, and pulls them back to active. [16:17] (not quite true, I'm still downloading on 2 and 3, but a total of about 60GB a day which is going to take a while :-) [16:19] I don't think that would be an immediate coding problem if someone with commit access was paying attention to move shards back to active [16:21] But we'll need, *handwave* 8TB or so of new users to fill up those gaps in the old shards. [16:24] I put a call out [16:24] There is a need for something to check for duplicates. You could probably free up a bunch of space by deleting duplicates. [16:24] The system we're using right now removes a lot of duplicates. [16:24] And we don't back up derives. [16:25] From reading the code if 7 or 8 fill up and people already have one of the older shards then they'll start downloading on it even if it's on maint; but they'll only check it out if it's on active [16:25] aye, but when I use git annex to drop files with 5+ copies, it does drop a lot. Freeing up space for other data and shards [16:25] Zero_Dogg: Is storage space the constraining issue though? [16:26] Or is it (a) bandwidth (b) people getting bored [16:26] *** primus104 has joined #internetarchive.bak [16:27] For me it is, at least. I have ̃~1.5TB or so. If I can remove dupes (which I'm trying to do by hand by issuing the git annex command) then I can store more of the data that's needed. Ie. more effective use of the people that have not given up, reducing the need for more people (but not eliminating it) [16:27] Over the whole dataset only about 1/6 of files are over-duplicated [16:27] The constraining issue is that I didn't want to fling new people on until I knew we were stable, and closure has been busy (as have I) [16:28] And they do provide extra redundancy when people expire [16:28] 1/6 is quite a lot of data though, in particular when some clients are expiring [16:28] true, getting more people would probably solve it entirely [16:28] Zero_Dogg: I agree it would be good to address the problem; I just don't think that it's the priority for coding attention :) [16:29] Reducing the amount of data that's over-downloaded in the first place might help more for instance [16:29] Senji_: no, probably not - at least not until getting new people becomes a problem [16:29] that it would, but that'd probably require patches to git-annex [16:31] Zero_Dogg: just adding --not --copies 4 to the ANNEX OPTIONS file (whatever it's called) helps a bit [16:32] Because currently it downloads any *directory* that has any file that's insufficiently duplicated in [16:33] *** balrog has quit IRC (Remote host closed the connection) [16:33] Senji_: ah, nod. Would still result in plenty of duplicates though, as it never actually updates the queue after starting to download. For me that meant it blindly grabbing stuff for weeks (due to bw limiting) without checking to see if there were dupes [16:34] some might also be solved by periodically killing off git annex and doing a sync before re-starting download [16:34] Zero_Dogg: it syncs every hour [16:34] nod, but the already running download doesn't care about that [16:34] it already has its queue [16:34] Well, again the extra --not --copies 4 helps [16:35] but will git-annex reload the database and re-check against those, instead of whatever the state was when the download started? [16:36] I don't think git-annex stores the database in memory at all; just the list of thinks it's been told to download [16:37] could be, depends on when it does the checks then I suppose. I'm not deep enough into the internals to know. If that is the case though, --not --copies 4 would help a great deal. [16:37] So every time it goes to the next directory it checks to see which things are inside it that it has to download [16:38] I'm just reasoning based on observation though [16:38] nod, if git annex does do the checks during runtime on-demand, that'd probably work [16:39] (and work better than deleting duplicates at some other time, as that could get messy with conflicts) [16:39] You still have the problem that for large files it's very easy for lots of people to be downloading it at once [16:40] I think shard7 has some *very* large files, but there were >100GB files in shard6; and noone can download those very quickly [16:40] aye, but given that it's all async and clients can't talk to each other, it'd be quite hard to safely remove dupes, at least without writing a whole load of synchronization stuff that would go through the server [16:40] They also rather distort the graphs, because a 100GB file takes up as much space as a 10b file :) [16:40] nod [16:41] hehe [16:41] Zero_Dogg: two step commit -- sync, move the affected files to a temp directory, sync again, check that it was still safe to having deleted them, then delete [16:41] that could work [16:42] maybe even add some delay between the first and the last operation to account for slow clients, that'd could work fairly well [16:44] I think in the long run over-duplication is probably a minor issue; assuming we can get enough people altogether :) [16:45] aye, given that enough people participate, it probably won't matter. But you do need some new people then :p [16:48] We absolutely need new people. [16:48] I just wanted to be sure we didn't have lingering issues engineering wise. [16:49] I believe closure has fixed the issue that corrupted shard3 [16:49] I'm sure there are other engineering issues; and as I say someone could do with putting some of those other shards on active [16:49] There's lots left to do on shard7 yet though and some of 3 and 8 [16:49] So new people won't be idle hands [16:49] Maybe chfoo- or yipdw need to assist with some of the dashboarding [17:21] *** balrog has joined #internetarchive.bak [18:06] *** mariusz_ has quit IRC (WeeChat 1.1) [18:07] *** mariusz1 has joined #internetarchive.bak [18:08] *** mariusz1 is now known as mariusz_ [18:09] SketchCow: Having more people is good, but what about reaching the the people that already took part in this project? do you have emails of all the people that signed up? why not send emails to people with instances that expired..or are about to expire? [18:10] The system has email addresses for *nearly* everyone [18:11] Yes. [18:11] People RIGHT at the beginning, I expected them to drift or commit. [18:13] it probably may be a good idea to reach to some of them. [18:13] I'll talk with closure about it [18:28] also, another problem I see is that when some instance expires it takes down with itself a large portion of one shard. perhaps having people download multiple shards at the same time would make this problem smaller and recovery faster, especially if we really have a lot of duplicates [18:34] *** db48x has joined #internetarchive.bak [18:37] Over time, I think that problem solves itself [18:43] *** atomotic has joined #internetarchive.bak [18:49] which problem is that? [18:49] Expiries making shards look very unhappy [18:50] I think [18:50] ah [18:56] *** zz_CyberJ is now known as CyberJaco [19:20] dashboarding eh [19:42] I'm starting to run out of disk space... [19:45] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [19:51] I wonder what the cheapest way of getting a pile more disk space is [20:07] *** mariusz_ has quit IRC (Read error: Operation timed out) [22:24] *** CyberJaco is now known as zz_CyberJ [23:08] *** toad2 has quit IRC (Read error: Operation timed out) [23:11] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [23:20] *** wp494 has joined #internetarchive.bak