#internetarchive.bak 2015-07-09,Thu

↑back Search

Time Nickname Message
00:08 🔗 CyberJaco is now known as zz_CyberJ
01:42 🔗 ZeroDogg has quit IRC (Read error: Operation timed out)
02:28 🔗 mntasauri has quit IRC (Ping timeout: 369 seconds)
02:38 🔗 mntasauri has joined #internetarchive.bak
02:55 🔗 primus104 has quit IRC (Leaving.)
03:15 🔗 Erkan has quit IRC (Read error: Connection reset by peer)
03:27 🔗 Erkan has joined #internetarchive.bak
06:07 🔗 zz_CyberJ is now known as CyberJaco
06:46 🔗 Zero_Dogg has joined #internetarchive.bak
08:12 🔗 primus104 has joined #internetarchive.bak
10:31 🔗 ohhdemgir has joined #internetarchive.bak
10:40 🔗 atomotic has joined #internetarchive.bak
11:45 🔗 primus104 has quit IRC (Leaving.)
12:03 🔗 mariusz_ has joined #internetarchive.bak
13:16 🔗 CyberJaco is now known as zz_CyberJ
13:36 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
14:07 🔗 zz_CyberJ is now known as CyberJaco
14:22 🔗 primus104 has joined #internetarchive.bak
14:31 🔗 CyberJaco is now known as zz_CyberJ
14:47 🔗 primus104 has quit IRC (Leaving.)
15:32 🔗 atomotic has joined #internetarchive.bak
15:37 🔗 SaltyBob has quit IRC (WeeChat 0.4.2)
16:11 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
16:12 🔗 SketchCow -------------------------------------------------------------------------------------
16:12 🔗 SketchCow OKAY TIME TO GET THE PROJECT BACK ON TRACK
16:12 🔗 SketchCow Closure, what's your time schedule look like, and what's our showstoppers
16:12 🔗 SketchCow -------------------------------------------------------------------------------------
16:12 🔗 SketchCow Also, do we need more people coding or do we need more people contributing?
16:12 🔗 SketchCow We haven't made up for lost space.
16:16 🔗 Senji_ There've been very few new registrations recently; I think we need more people.
16:16 🔗 Senji_ Also some of those previously-full shards probably need to go on active
16:16 🔗 SketchCow I am all for doing that, but I notice we have a problem with old shards not coming back.
16:17 🔗 Senji_ Well, they're not coming back because noone's downloading
16:17 🔗 SketchCow I feel like we might need some sort of routine that checks old shards, and pulls them back to active.
16:17 🔗 Senji_ (not quite true, I'm still downloading on 2 and 3, but a total of about 60GB a day which is going to take a while :-)
16:19 🔗 Senji_ I don't think that would be an immediate coding problem if someone with commit access was paying attention to move shards back to active
16:21 🔗 Senji_ But we'll need, *handwave* 8TB or so of new users to fill up those gaps in the old shards.
16:24 🔗 SketchCow I put a call out
16:24 🔗 Zero_Dogg There is a need for something to check for duplicates. You could probably free up a bunch of space by deleting duplicates.
16:24 🔗 SketchCow The system we're using right now removes a lot of duplicates.
16:24 🔗 SketchCow And we don't back up derives.
16:25 🔗 Senji_ From reading the code if 7 or 8 fill up and people already have one of the older shards then they'll start downloading on it even if it's on maint; but they'll only check it out if it's on active
16:25 🔗 Zero_Dogg aye, but when I use git annex to drop files with 5+ copies, it does drop a lot. Freeing up space for other data and shards
16:25 🔗 Senji_ Zero_Dogg: Is storage space the constraining issue though?
16:26 🔗 Senji_ Or is it (a) bandwidth (b) people getting bored
16:26 🔗 primus104 has joined #internetarchive.bak
16:27 🔗 Zero_Dogg For me it is, at least. I have ̃~1.5TB or so. If I can remove dupes (which I'm trying to do by hand by issuing the git annex command) then I can store more of the data that's needed. Ie. more effective use of the people that have not given up, reducing the need for more people (but not eliminating it)
16:27 🔗 Senji_ Over the whole dataset only about 1/6 of files are over-duplicated
16:27 🔗 SketchCow The constraining issue is that I didn't want to fling new people on until I knew we were stable, and closure has been busy (as have I)
16:28 🔗 Senji_ And they do provide extra redundancy when people expire
16:28 🔗 Zero_Dogg 1/6 is quite a lot of data though, in particular when some clients are expiring
16:28 🔗 Zero_Dogg true, getting more people would probably solve it entirely
16:28 🔗 Senji_ Zero_Dogg: I agree it would be good to address the problem; I just don't think that it's the priority for coding attention :)
16:29 🔗 Senji_ Reducing the amount of data that's over-downloaded in the first place might help more for instance
16:29 🔗 Zero_Dogg Senji_: no, probably not - at least not until getting new people becomes a problem
16:29 🔗 Zero_Dogg that it would, but that'd probably require patches to git-annex
16:31 🔗 Senji_ Zero_Dogg: just adding --not --copies 4 to the ANNEX OPTIONS file (whatever it's called) helps a bit
16:32 🔗 Senji_ Because currently it downloads any *directory* that has any file that's insufficiently duplicated in
16:33 🔗 balrog has quit IRC (Remote host closed the connection)
16:33 🔗 Zero_Dogg Senji_: ah, nod. Would still result in plenty of duplicates though, as it never actually updates the queue after starting to download. For me that meant it blindly grabbing stuff for weeks (due to bw limiting) without checking to see if there were dupes
16:34 🔗 Zero_Dogg some might also be solved by periodically killing off git annex and doing a sync before re-starting download
16:34 🔗 Senji_ Zero_Dogg: it syncs every hour
16:34 🔗 Zero_Dogg nod, but the already running download doesn't care about that
16:34 🔗 Zero_Dogg it already has its queue
16:34 🔗 Senji_ Well, again the extra --not --copies 4 helps
16:35 🔗 Zero_Dogg but will git-annex reload the database and re-check against those, instead of whatever the state was when the download started?
16:36 🔗 Senji_ I don't think git-annex stores the database in memory at all; just the list of thinks it's been told to download
16:37 🔗 Zero_Dogg could be, depends on when it does the checks then I suppose. I'm not deep enough into the internals to know. If that is the case though, --not --copies 4 would help a great deal.
16:37 🔗 Senji_ So every time it goes to the next directory it checks to see which things are inside it that it has to download
16:38 🔗 Senji_ I'm just reasoning based on observation though
16:38 🔗 Zero_Dogg nod, if git annex does do the checks during runtime on-demand, that'd probably work
16:39 🔗 Zero_Dogg (and work better than deleting duplicates at some other time, as that could get messy with conflicts)
16:39 🔗 Senji_ You still have the problem that for large files it's very easy for lots of people to be downloading it at once
16:40 🔗 Senji_ I think shard7 has some *very* large files, but there were >100GB files in shard6; and noone can download those very quickly
16:40 🔗 Zero_Dogg aye, but given that it's all async and clients can't talk to each other, it'd be quite hard to safely remove dupes, at least without writing a whole load of synchronization stuff that would go through the server
16:40 🔗 Senji_ They also rather distort the graphs, because a 100GB file takes up as much space as a 10b file :)
16:40 🔗 Zero_Dogg nod
16:41 🔗 Zero_Dogg hehe
16:41 🔗 Senji_ Zero_Dogg: two step commit -- sync, move the affected files to a temp directory, sync again, check that it was still safe to having deleted them, then delete
16:41 🔗 Zero_Dogg that could work
16:42 🔗 Zero_Dogg maybe even add some delay between the first and the last operation to account for slow clients, that'd could work fairly well
16:44 🔗 Senji_ I think in the long run over-duplication is probably a minor issue; assuming we can get enough people altogether :)
16:45 🔗 Zero_Dogg aye, given that enough people participate, it probably won't matter. But you do need some new people then :p
16:48 🔗 SketchCow We absolutely need new people.
16:48 🔗 SketchCow I just wanted to be sure we didn't have lingering issues engineering wise.
16:49 🔗 Senji_ I believe closure has fixed the issue that corrupted shard3
16:49 🔗 Senji_ I'm sure there are other engineering issues; and as I say someone could do with putting some of those other shards on active
16:49 🔗 Senji_ There's lots left to do on shard7 yet though and some of 3 and 8
16:49 🔗 Senji_ So new people won't be idle hands
16:49 🔗 SketchCow Maybe chfoo- or yipdw need to assist with some of the dashboarding
17:21 🔗 balrog has joined #internetarchive.bak
18:06 🔗 mariusz_ has quit IRC (WeeChat 1.1)
18:07 🔗 mariusz1 has joined #internetarchive.bak
18:08 🔗 mariusz1 is now known as mariusz_
18:09 🔗 mariusz_ SketchCow: Having more people is good, but what about reaching the the people that already took part in this project? do you have emails of all the people that signed up? why not send emails to people with instances that expired..or are about to expire?
18:10 🔗 Senji_ The system has email addresses for *nearly* everyone
18:11 🔗 SketchCow Yes.
18:11 🔗 SketchCow People RIGHT at the beginning, I expected them to drift or commit.
18:13 🔗 mariusz_ it probably may be a good idea to reach to some of them.
18:13 🔗 SketchCow I'll talk with closure about it
18:28 🔗 mariusz_ also, another problem I see is that when some instance expires it takes down with itself a large portion of one shard. perhaps having people download multiple shards at the same time would make this problem smaller and recovery faster, especially if we really have a lot of duplicates
18:34 🔗 db48x has joined #internetarchive.bak
18:37 🔗 SketchCow Over time, I think that problem solves itself
18:43 🔗 atomotic has joined #internetarchive.bak
18:49 🔗 db48x which problem is that?
18:49 🔗 Senji_ Expiries making shards look very unhappy
18:50 🔗 Senji_ I think
18:50 🔗 db48x ah
18:56 🔗 zz_CyberJ is now known as CyberJaco
19:20 🔗 yipdw dashboarding eh
19:42 🔗 Senji_ I'm starting to run out of disk space...
19:45 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
19:51 🔗 Senji_ I wonder what the cheapest way of getting a pile more disk space is
20:07 🔗 mariusz_ has quit IRC (Read error: Operation timed out)
22:24 🔗 CyberJaco is now known as zz_CyberJ
23:08 🔗 toad2 has quit IRC (Read error: Operation timed out)
23:11 🔗 wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
23:20 🔗 wp494 has joined #internetarchive.bak

irclogger-viewer