#internetarchive.bak 2015-03-05,Thu

↑back Search

Time Nickname Message
00:09 🔗 bpye_ (~quassel@[redacted]) has joined #internetarchive.bak
00:09 🔗 bpye_ is now known as bpye
00:14 🔗 edward_ (~edward@[redacted]) has joined #internetarchive.bak
00:25 🔗 slagfart (cb2352ae@[redacted]) has joined #internetarchive.bak
00:25 🔗 slagfart has quit (Client Quit)
00:26 🔗 Slagfart (webchat@[redacted]) has joined #internetarchive.bak
00:27 🔗 Slagfart anyone solved it yet?
00:33 🔗 edward_ has quit (Ping timeout: 512 seconds)
00:33 🔗 db48x Slagfart: solved what, exactly?
00:37 🔗 fenn goldbach's conjecture, duh...
00:43 🔗 xmc a way to prevent people from cheating on actually storing data would be to have the checks be for a random byte-range of the file, rather than the whole thing
00:44 🔗 jbenet xmc: look into proofs of retrievability
00:44 🔗 xmc i haven't, maybe later
00:44 🔗 jbenet my personal favorite: https://cseweb.ucsd.edu/~hovav/dist/verstore.pdf
00:45 🔗 jbenet but also, a much simpler merkle-tree (actual merkle tree) proof-of-storage will work just fine.
00:45 🔗 knytt (~knytt@[redacted]) has joined #internetarchive.bak
00:45 🔗 knytt I am drunk and I want to know everything that has happened with this so far
00:45 🔗 knytt spare no details
00:46 🔗 knytt for I am smart and can digest information
00:46 🔗 garyrh knytt, look at http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK
00:47 🔗 knytt garyrh, you are a gentleman and a scholar
00:50 🔗 closure xmc: the problem with that method is thast if someone wants to be an asshat, they do this: 1. pretend to have a lot of files. 2. find the urls to download those files from the IA. 3. when asked for a proof, download just the byte range needed to generate the "proof"
00:50 🔗 closure or more simply, just actually store files, but if IA ever asks for them back, damand $$$
00:58 🔗 garyrh yeah, I think the only way to reduce the chance of that happening would be some kind of trust system
00:58 🔗 garyrh like registered accounts are more trustworthy than anonymous, etc.
00:58 🔗 jbenet you can do it with crypto, but it gets hard.
00:59 🔗 jbenet thats what the proofs-of-retrievability are for-- they extract the file with each valid proof.
00:59 🔗 jbenet it's slow, but it works in presence of an adversary that wont actually transmit the file back
00:59 🔗 jbenet -- also, on outsourcing storage, http://cs.umd.edu/~amiller/nonoutsourceable.pdf can be adapted to prevent that
01:00 🔗 jbenet it's a very cool idea: you force them to do proofs with their private key (thats used for some reward/memebership) as fast as they can, and based on sequentially determined indices into the file, so they are _forced_ to keep the entire file locally.
01:01 🔗 garyrh ah, that's interesting.
01:01 🔗 Peter_ (~Peter1231@[redacted]) has joined #internetarchive.bak
01:01 🔗 jbenet i think https://www.cs.umd.edu/~elaine/docs/permacoin.pdf has the construction
01:01 🔗 Peter_ hey
01:02 🔗 jbenet (we're building something similar too-- http://filecoin.io/filecoin.pdf -- and we may borrow nonoutsourceability from permacoin down the road)
01:02 🔗 jbenet garyrh, xmc, we're pretty serious about supporting you guys in this endeavor. it matters a lot to us
01:03 🔗 jbenet closure pointed me to the proposal he wrote-- i'll try do the same
01:04 🔗 fenn an adversary who is holding the file ransom wouldn't continue responding to large numbers of proofs of retrievability requests
01:05 🔗 Peter_ has quit (Peter_)
01:06 🔗 jbenet fenn: yeah that's accounted for in these protocols. that counts as malicious behavior that is not rewarded (with trust, monetary rewards, whatever)
01:06 🔗 jbenet you can ride a line of "being too computationally busy to respond" and "limiting responses"
01:06 🔗 closure jbenet: I suggest your proposal be shorter -- I suspect ipfs will also allow it to be more elegant ;)
01:06 🔗 jbenet but the point is you can get the file out eventually, which is useful.
01:07 🔗 closure problem is, an actor can behave entirely trustworthily, until the IA burns down, and then switch to ransom mode.
01:08 🔗 jbenet (also for retrieving the file through the proofs, the SW scheme is worse, because it's compact. it's better to have one of the more bandwidth intensive ones)
01:09 🔗 fenn i'm not sure this is really a problem. the ransomer would have to own all backup copies of the file
01:09 🔗 jbenet closure: yeah which is why often these protocols from academia are cast as "cloud storage from company X" who has trust to lose. and the blockchain protocols cast in terms of proofs they must do to win money
01:09 🔗 closure if you have an ongoing incentive system, I can see those proofs working, but in our case, there's not currently much incentive aside from nice/easy, and at the point the data needs to be retrived, the situation has changed
01:09 🔗 jbenet closure: yep. agreed
01:10 🔗 closure all I can see to do about this is try to avoid situations where only bad actors have a given file
01:11 🔗 jbenet or have an incentive system that is independent of the daat
01:11 🔗 jbenet data*
01:11 🔗 jbenet (eg. the cryptocurrency approaches)
01:12 🔗 thunk (4746deec@[redacted]) has joined #internetarchive.bak
01:12 🔗 jbenet (pre-encrypting the data helps. though in the archive's case.... who else would store that much data? {archive, google, gov agencies}
01:13 🔗 Slagfart why differentiate between the retrieval process itself, and proof of ownership?
01:13 🔗 Slagfart wouldn't seeding valid data to other users effectively be the same as non-ransomability?
01:14 🔗 closure no, because it's easy to automate detecting "fire at IA, all data lost" and switching strategies to defection
01:14 🔗 Slagfart say you've got a torrent swarm going, and IA burns down. IA should simply reconnect as a peer, and do it quietly.
01:15 🔗 Slagfart really? is it? would require massive collusion
01:15 🔗 closure I suppose it works for lesser disasters, like the wrong set of drives all dying
01:16 🔗 zx (sid17829@[redacted]) has joined #internetarchive.bak
01:17 🔗 Slagfart if anyone can join and contribute, you can inherently make an assumption that a % of users are honest. I don't think a ransomer is going to bother, because their payoff is so uncertain
01:18 🔗 jbenet so, unless you're doing constant non-outsourceability proofs, all the nodes could pool their storage and store 1 replica. all non-honest nodes (i.e. epsilon rational nodes) would.
01:18 🔗 Slagfart let's assume 50% of users are ransomers, and 50% of the remainder are genuine but are unreliable. you've still only got the need for 4 copies out there to cover for that
01:19 🔗 jbenet so the total storage % that the honest nodes control matters a lot without NOSP (non-outsoucr....)
01:19 🔗 Slagfart doesn't it only matter in relation to the total pool?
01:20 🔗 jbenet right, it's the % of the total addressable storage
01:20 🔗 Slagfart if it's 10% honest nodes, but the pool capacity is 2000% of what's required, you've still got two points of failure
01:20 🔗 jbenet cause the honest nodes better distribute multiple replicas between them
01:21 🔗 Slagfart if the pool capacity doubles (hard drives and bandwidth double in size overnight), you only need 5% honest nodes
01:21 🔗 Slagfart that honest node problem has already been solved using bittorrent, via hash trees
01:21 🔗 jbenet yeah, also in the pool world, sub-nodes would defect, so the ransom total is not infinity
01:22 🔗 jbenet (or i think-- i havent mathed)
01:22 🔗 Slagfart my understanding is, you can't fake large datasets against an arbritrary hash if the hash is modern enough (eg SHA2) without computing resources that would drastically exceed the potential payout
01:24 🔗 jbenet that depends entirely on "the potential payout"-- how much are people willing to pay for the wealth of humanity's knowledge?
01:24 🔗 Slagfart i reckon just torrent it imho guys hey
01:24 🔗 Slagfart jbenet - I think it's very low! it's already available for free
01:24 🔗 jbenet not it it's the only copy left
01:24 🔗 jbenet the value increases dramatically
01:25 🔗 Slagfart also let's not kid outselves - this isn't the wealth of humanity's knowledge, this is largely old Geocities sites
01:25 🔗 jbenet anyway i tend to agree- simple seeding will likely work in practice (hence why ipfs doesnt include any proofs-of-retrievability)
01:25 🔗 jbenet hahahaha
01:25 🔗 garyrh heh not really.
01:25 🔗 Slagfart I agree jbenet, but if you create this, and you see 10 seeds on the torrent swarm, I think you could rest easy
01:26 🔗 Slagfart :)
01:26 🔗 jbenet ok, i need to change locations -- archiveteam, will post a proposal for you soon, would be great to work together on this, you too closure :) <3
01:29 🔗 Slagfart has anyone worked out how to zip up 20 petabytes?
01:30 🔗 garyrh Very slowly.
01:31 🔗 Slagfart you've got a method to keep track of what's in each zip?
01:31 🔗 closure Slagfart: I think you'll find you get a torrent file of some truely amazing size (like 1 terabyte) and/or your bittorrent client runs out of ram and/or each chunk is so many giabytes in size that it turns out to be nearly impossible to complete any chunk
01:31 🔗 godane has quit (Quit: Leaving.)
01:32 🔗 Slagfart oh I agree - but the main page halready has the proposal to split it up into 42k 500GB chunks
01:32 🔗 Slagfart each one would be a different torrent. the piratebay for example hosts much more than this, and any given torrent can be reliably downloaded :)
01:33 🔗 Slagfart there's literally a financial disincentive to seed on the piratebay, but people keep doing it. I think the altruism will be a big factor - why do you guys get donations every month?
01:38 🔗 Lord_Nigh (~Lord_Nigh@[redacted]) has joined #internetarchive.bak
01:38 🔗 Lord_Nigh hi all
01:39 🔗 closure Slagfart: I think that could work, and it has a virtue of simplicity (aside from the whole bad actor issue)
01:40 🔗 knytt has quit (Quit: Leaving)
01:41 🔗 Slagfart I'm changing locations too - http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork
01:41 🔗 Slagfart :)
01:41 🔗 Slagfart will be back later - interesting discussion! cheers all
01:42 🔗 closure Slagfart: here, I've written it up http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/torrents_implementation
01:44 🔗 Lord_Nigh is prometheus from ipfs in here?
01:44 🔗 Lord_Nigh https://news.ycombinator.com/item?id=9148576
01:45 🔗 Lord_Nigh as for integrity validation of 500gb blocks (assuming that's how the system ends up getting done) wouldn't it make the most sense to pack a block into 475gb, pack on 25gb of parity/ecc data, and sign the whole damn thing with an IA master key?
01:45 🔗 garyrh I think that's jbenet.
01:46 🔗 Slagfart has quit (Ping timeout: 240 seconds)
01:51 🔗 jbenet Lord_Nigh yeah that's me
01:54 🔗 closure oh, top of HN right now, I see
01:54 🔗 closure I've done some proposal gardening on the wiki page
01:58 🔗 closure hmm, the geocities torrent was 1 tb, and it strained some torrent stuff and had trouble getting well seeded.
02:00 🔗 mike` (~mike@[redacted]) has joined #internetarchive.bak
02:05 🔗 closure the geocities torrent currently has 1 leech and 0 seeds :(
02:05 🔗 closure and that's the 641 gb fixed version
02:09 🔗 fenn what specifically went wrong with the 1TB version?
02:21 🔗 fenn d'eaux https://thepiratebay.se/torrent/6350414/Geocities_-_The_PATCHED_Torrent 404's
02:21 🔗 closure https://thepiratebay.se/torrent/6353395/Geocities_-_The_PATCHED_Torrent
02:37 🔗 S[h]O[r]T piece size was too big
02:50 🔗 Control-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak
02:56 🔗 Ctrl-S has quit (Read error: Connection reset by peer)
02:56 🔗 Control-S is now known as Ctrl-S
02:56 🔗 godane (~slacker@[redacted]) has joined #internetarchive.bak
02:57 🔗 svchfoo2 gives channel operator status to godane
04:37 🔗 closure well, the torrents idea is looking a little less likely.. how do all those TB of torrents ever get seeded to start?
04:40 🔗 ivan` uTorrent and others support webseeds which grabs over HTTP
04:41 🔗 closure yes, but then you need a file, up for http
04:41 🔗 db48x which IA already done
04:41 🔗 db48x does
04:41 🔗 closure maybe the IA could seed some fraction of all the files at a time, not all of them. Woud need additional PB of storage
04:42 🔗 closure not in 500 gb collections of items, it doesn't
04:42 🔗 closure see page
04:43 🔗 db48x oh, yea
04:44 🔗 db48x where you split the archive up into uniform-sized chunks, each with a torrent
04:44 🔗 closure right
04:44 🔗 db48x I don't see why that'd be necessary though; IA already has a torrent for every item
04:44 🔗 closure which are not exactly all getting lots of seeds
04:45 🔗 db48x https://ia700800.us.archive.org/17/items/ZztByEpicMegagames/ZztByEpicMegagames_archive.torrent
04:45 🔗 db48x yes, thus my suggestion of a custom BitTorrent client
04:45 🔗 closure it's hard to get 28 million torrents seeded I think
04:45 🔗 xmc closure: hmmm, true.
04:45 🔗 db48x they don't have seeds because there are too many to join any fraction of them
04:46 🔗 db48x most users would have to manually click on a thousand torrent links
04:46 🔗 closure well, if there are 10 thousand users, and you want 10 copies of every file, that bittorrent client would need to load up 28 thousand torrents. Is that doable? I know some people run a lot of torrents, but..
04:47 🔗 db48x if instead they could download a client and tell it that they'd like it to use 100GB of space, and that they like jazz, then it could go join a bunch of torrent swarms automatically
04:47 🔗 closure s/bunch/metric fuckton/
04:47 🔗 db48x that's a point, yes
04:49 🔗 db48x I hadn't considered the constant overhead of the swarm talking to itself; it's a good point
04:49 🔗 closure assume each torrent takes I dunno, 100 kb of ram. That's 3 gb of ram used by the torrent client for 28k torrents
04:49 🔗 db48x it probably dies out once the swarm stabilizes and has no peers, then picks up again when there is a new peer
04:50 🔗 db48x 3gb of address space; it could be swapped out
04:50 🔗 db48x it won't matter if it takes the client a few ms to swap it back and to answer a query about what pieces it has
04:50 🔗 closure you have to track chunks, peers, etc, etc, I don't have numbers, but 100 kb ram seems ballpark
04:51 🔗 db48x agreed
04:52 🔗 closure also the whole tracker side.. anyone remember how many items are in TPB?
04:52 🔗 db48x probably varies on chunk size and the number of chunks, which is quite variable for IA items
04:53 🔗 db48x https://news.ycombinator.com/item?id=9149262
04:53 🔗 db48x https://torrentfreak.com/download-a-copy-of-the-pirate-bay-its-only-90-mb-120209/ says 1.6 million in 2012
04:54 🔗 closure that comment is wrong, you still need trackers even if using magnet links
04:55 🔗 db48x yea
04:55 🔗 db48x just a quick source of numbers
04:56 🔗 closure yeah, sounds like there might be tracker software that can handle XX million torrents
04:56 🔗 closure pretty amazing
04:57 🔗 db48x yea, impressive
04:57 🔗 db48x and memory gets cheaper all the time :)
04:57 🔗 db48x btw, use ~~~~ on the wiki to insert a signature
04:57 🔗 closure I suppose trackers could shard pretty well amoung machines
04:58 🔗 WubTheCap (~wub@[redacted]) has joined #internetarchive.bak
05:07 🔗 cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
05:09 🔗 cf_ (~nickgrego@[redacted]) has left #internetarchive.bak
05:09 🔗 cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
05:42 🔗 X-Scale has quit (Ping timeout: 240 seconds)
06:00 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
06:05 🔗 dyln (~user@[redacted]) has joined #internetarchive.bak
06:06 🔗 bzc6p has quit (Ping timeout: 601 seconds)
06:25 🔗 SketchCow Damn, things got crazy
06:26 🔗 SketchCow So.
06:26 🔗 SketchCow 1. I really want to go forward with Closure's solution. It's by far my favorite.
06:27 🔗 SketchCow 2. I am not against the other two proposals. If they want to initiate tests and the same methods of incrementally doing tests, go for it. Having families and methods is nice.
06:27 🔗 SketchCow 3. That said, I think toerrents is insane
06:27 🔗 db48x the whole idea is insane :P
06:31 🔗 SketchCow Yes
06:31 🔗 SketchCow I also think ipfs is interesting, but different than this.
06:32 🔗 SketchCow (Niw that I read it)
06:32 🔗 db48x but torrents is probably the insaner of the two
06:32 🔗 SketchCow So, I will push more for closue.
06:32 🔗 SketchCow Of the three, there's three
06:32 🔗 db48x of the N
06:32 🔗 SketchCow So, I sent our internal guy on the Censs.
06:33 🔗 SketchCow Census.
06:33 🔗 SketchCow 14,926,080 items in the database.
06:33 🔗 SketchCow (Not darked, not in some way private, not in some way weird infrastructure in the database)
06:33 🔗 SketchCow Next, he's building a full mined list of these items.
06:33 🔗 SketchCow I mentioned this to Brewster
06:34 🔗 SketchCow Brewster suggests Prelinger be one of the collections.
06:35 🔗 DFJustin I think having 'photogenic' collections at least at first will probably help volunteer uptake / virality
06:35 🔗 SketchCow Yes
07:13 🔗 ersi (~ersi@[redacted]) has joined #internetarchive.bak
07:18 🔗 dyln has quit (Read error: Operation timed out)
07:50 🔗 pg (webchat@[redacted]) has joined #internetarchive.bak
07:51 🔗 pg has quit (Client Quit)
08:16 🔗 edward_ (~edward@[redacted]) has joined #internetarchive.bak
08:31 🔗 edward_ https://news.ycombinator.com/item?id=9147719
08:37 🔗 xmc you
08:37 🔗 xmc so it is you we have to blame for being paid attention to
08:44 🔗 Sanqui that's kind of early attention
09:02 🔗 inversech (~smuxi@[redacted]) has joined #internetarchive.bak
09:07 🔗 midas (~midas@[redacted]) has joined #internetarchive.bak
09:13 🔗 yipdw speaking of being photogenic, since I'm already poking around IA's statusboard, might as well also look into how tied it is to showing book scans
09:14 🔗 yipdw if there's a way to adapt its presentation style to show any item preview that might be fun
09:19 🔗 hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak
09:20 🔗 hatseflat hi everyone
09:47 🔗 aschmitz has quit (Ping timeout: 265 seconds)
09:56 🔗 X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
09:58 🔗 aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
10:58 🔗 Sanqui yipdw: it would be really cool if you could show thumbnails of the images, snippets of the text, samples of the audio, etc. inside the archives
11:09 🔗 Cameron_D (~cameron@[redacted]) has joined #internetarchive.bak
11:25 🔗 zx (sid17829@[redacted]) has left #internetarchive.bak
13:01 🔗 godane has quit (Quit: Leaving.)
13:13 🔗 trs80 (trs80@[redacted]) has joined #internetarchive.bak
13:14 🔗 svchfoo1 gives channel operator status to trs80
13:14 🔗 svchfoo2 gives channel operator status to trs80
13:15 🔗 nicoo (~nico@[redacted]) has joined #internetarchive.bak
13:26 🔗 godane (~slacker@[redacted]) has joined #internetarchive.bak
13:27 🔗 svchfoo2 gives channel operator status to godane
13:36 🔗 ams (webchat@[redacted]) has joined #internetarchive.bak
13:43 🔗 phuzion (~phuzion@[redacted]) has joined #internetarchive.bak
13:52 🔗 cf_ Hi all. Just want to toss out my 2 cents: 1) I don’t think the pure torrents solution will work - there are just too many intermediary steps to repack the data and too much to keep track of afterwards. It’s possible and the easiest solution upfront, but in the long run I don’t think it’ll work out too well. 2) IPFS is really neat, but just isn’t ready yet. I’ve been running some tests on a VPS with one of our twitch megawarcs and itâ€
13:52 🔗 cf_ just not ready for the level of use that we would be forcing on it. I also haven’t seen a way to easily determine exactly which other nodes should receive a given file - i.e. it seems like you just push the file out and the network decides who gets copies of it. I say this only because if we’re going to be pushing an extraordinary amount of pretty important data on to the network (and I think 21PB qualifies for this), I feel like we need to b
13:52 🔗 cf_ able to control it so that it only goes onto machines owned by people who are willing to keep up a certain level of uptime, throughput, etc. 3) that leaves us with git-annex, which I think is the best solution. The dev is part of the team, the design proposal makes it look dead simple to use and implement a solution with, and the only issue that I see is the file limit, but (as discussed) that’s easily fixed with shards. Anyways, that’s just
13:52 🔗 cf_ POV on this.
13:54 🔗 ams has quit (Ping timeout: 240 seconds)
14:00 🔗 VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
14:03 🔗 raylee (~raylee@[redacted]) has joined #internetarchive.bak
14:18 🔗 closure SketchCow: it's funny, torrents is my favorite solution, if seeding can be solved :)
14:19 🔗 closure one seeding solution is to just start with 500 gb torrent 1, get it seeded to enough people we trust it will live, and delete from our server which moves on to 500 gb torrent 2
14:20 🔗 closure and then if torrent 1 becomes unhealthy, we either a) recreate it from IA Items and get it re-seeded, or b) find we cannot do that anymore (eg, some Items went dark or were modified) and stop offering torrent 1, instead offering torrennt 1B which contains all the items that were in torrrent 1
14:20 🔗 closure seems pretty doable
14:21 🔗 ersi cf_: Pro-tip: This is IRC, if you paste long text - it's going to get cut/wrapped
14:21 🔗 ersi s/cut\/wrapped/truncated/
14:21 🔗 Sanqui not a fan of the torrent solution
14:21 🔗 ersi I'm a huge fan
14:22 🔗 ersi blows some hot air around
14:22 🔗 closure it's easy, and I don't have to do any off the work >> fan
14:22 🔗 Sanqui it would be cool if distribution was.. distributed a la torrents
14:22 🔗 Sanqui but torrents are unfriendly to splitting and cold storage
14:22 🔗 closure this is true
14:23 🔗 closure git-annex wins there
14:23 🔗 Sanqui would it be possible for git-annex to be peer to peer?
14:24 🔗 closure if the peers have some way of being introduced and communicating, yes
14:25 🔗 Sanqui that's the only advantage I saw in torrents (besides simplicity of implementation)
14:26 🔗 arkiver kut
14:26 🔗 arkiver oops
14:33 🔗 closure is adding ipfs support to git-annex this morning, BTW
14:33 🔗 closure :)
14:54 🔗 phuzion What about Tahoe-LAFS?
14:55 🔗 phuzion I dunno how well it would handle 21PB, but if it handles it well, I think it could certainly be a contender for a storage solution.
15:14 🔗 csssuf (~csssuf@[redacted]) has joined #internetarchive.bak
15:26 🔗 Start has quit (Disconnected.)
16:02 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
16:09 🔗 Start has quit (Read error: Connection reset by peer)
16:09 🔗 Start_ (~Start@[redacted]) has joined #internetarchive.bak
16:20 🔗 enkiv2 (~john@[redacted]) has joined #internetarchive.bak
16:24 🔗 nicoo has quit (Ping timeout: 260 seconds)
16:52 🔗 Start_ has quit (Disconnected.)
17:01 🔗 nicoo (~nico@[redacted]) has joined #internetarchive.bak
17:10 🔗 wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
17:32 🔗 midas lol arkiver :p
17:32 🔗 arkiver :/
17:32 🔗 midas the kut moment
17:32 🔗 arkiver was working on something, didn't work, typed in wrong chat :P
17:33 🔗 midas :p
17:39 🔗 everdred (~irssi@[redacted]) has left #internetarchive.bak
17:49 🔗 WubTheCap has quit (Quit: Restart)
17:50 🔗 WubTheCap (~wub@[redacted]) has joined #internetarchive.bak
17:53 🔗 ersi kut
18:01 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
18:13 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
18:20 🔗 destrudo (~destrudo@[redacted]) has joined #internetarchive.bak
18:20 🔗 shabble (~shabble@[redacted]) has joined #internetarchive.bak
18:21 🔗 mianaai (~user@[redacted]) has joined #internetarchive.bak
18:21 🔗 mianaai hi
18:25 🔗 ersi Hi.
18:27 🔗 bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
18:31 🔗 bzc6p_ has quit (Read error: Operation timed out)
18:42 🔗 Start has quit (Disconnected.)
18:57 🔗 closure is now known as joeyh
18:57 🔗 bzc6p__ is now known as bzc6p
19:22 🔗 jake1 (~Adium@[redacted]) has joined #internetarchive.bak
19:53 🔗 VADemon has quit (Read error: Connection reset by peer)
20:10 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
20:12 🔗 X-Scale has quit (Ping timeout: 240 seconds)
20:14 🔗 jbenet SketchCow, cf_, joeyh: yeah ipfs is not perfectly ready yet, but please note we're moving really fast. 6mo ago nothing worked. i think we can get the reliability you need reasonably quickly-- and we are eager to help get there (we have to do it anyway). these projects tend to last _years_. maybe a good thing to do is this: give us a measure of the
20:14 🔗 jbenet even faster. )
20:14 🔗 jbenet reliability you need and we can optimize for it-- i.e. point us to a specific sample workload (say "replicate this specific 1TB archive") and we can race to it. -- you can keep doing what you're doing and if we make the needed progress reasonably quickly you can re-eval then. ( also, joeyh: if we can get your help/advice along the way we can probably move
20:14 🔗 jbenet (reasonably quickly measured in weeks)
20:15 🔗 jbenet also-- if i were you, i would built a layer of indirection between any backend system, so you can upgrade over time. these efforts last _years_.
20:15 🔗 jbenet can upgrade/not forcibly-depend on anything
20:19 🔗 joeyh +1 layer of indirection
20:20 🔗 joeyh jbenet: so, looking at the ipfs data model and mapping it onto this, I imagine something like:
20:20 🔗 joeyh 1. IA adds each of their items to ipfs, gets the ipfs address for it
20:20 🔗 joeyh 2. users can then download the items into their own ipfs nodes
20:21 🔗 joeyh 3. then we need some way for users to communicate (or better, prove, but..) that they are backing up a given item
20:22 🔗 joeyh #3 could be done by the user setting up their own ipns namespace, and publishing a list of the items that have there, I suppose
20:23 🔗 mike` (~mike@[redacted]) has left #internetarchive.bak
20:24 🔗 joeyh #1 is somewhat problimatic to do without using vast amounts of disk space used at the IA for ipfs
20:24 🔗 jbenet joeyh: yeah exactly. ipfs can also be used as a lib-- currently in Go, but can easily make a special binary with your own protocol that can do more sophisticated things if you need them.
20:24 🔗 jbenet my guess is any stock ipfs client can do what you need, but the power is there if you need it
20:25 🔗 joeyh what seems to remain is scalability, cf my experience last night with an OOM kill of ipfs when downloading a few hundred MB
20:25 🔗 jbenet on #1 one possibility is to make ipfs use an index on existing fs-- this is certainly not a good model for the average user, but for dedicated installations and PB of data, you dont want to throw it all into leveldb ;P
20:25 🔗 joeyh ah, that would be great for #1
20:25 🔗 jbenet joeyh: what is the size of the machine?
20:26 🔗 joeyh that machine has 2 gb of ram, probably well over 1 gb free
20:26 🔗 jbenet joeyh: i havent seen OOM for a long time-- it may be something about getting into a weird state
20:27 🔗 jbenet (i've been booting GB vms this week)
20:27 🔗 Start has quit (Disconnected.)
20:28 🔗 jbenet joeyh: can you repro reliably? would be awesome to get a stack trace (can kill a go proc witl crtl+\ anytime)
20:28 🔗 joeyh largest file in the IA is apparently 2 tb, for reference
20:28 🔗 jbenet (( anyway would love to sink deeper into this one test run-- but also dont want to pollute this channel -- either way :) ))
20:28 🔗 joeyh ok, if that's an unexpected bug, I'll try to repro it
20:29 🔗 joeyh but then it curves right down to 100 gb or so files
20:29 🔗 joeyh of course, there's also the question of scaling to a great many items
20:31 🔗 jbenet joeyh: awesome, is there a text file with all these sizes? how many 50-200GB+ files? -- i could generate random data of similar sizes and treat that as a test suite to go for)
20:31 🔗 joeyh apparenty the IA is working on getting us a full list of Items
20:31 🔗 jbenet joeyh: yep, i think the sanest thing is to shard right now (can use an index of ipfs objects themselves and tell different sets of nodes to `ipfs pin -r` different subsets
20:31 🔗 jbenet sweet!
20:32 🔗 joeyh my impression was not many > 100 gb
20:32 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
20:32 🔗 joeyh and, 18 million items, could easily be another order of magnitude files
20:32 🔗 xmc yeah. for a long time IA policy was to strive to keep items under 10G each
20:32 🔗 xmc archiveteam blew right through that
20:33 🔗 xmc 10G because items can't be split across physical volumes, in current design
20:34 🔗 jbenet xmc: why not? cant split items into subblocks?
20:34 🔗 yipdw you'll find a couple hundred 50 GB items in the archiveteam collections
20:34 🔗 xmc jbenet: i think they treat an item as a directory in a real unix filesystem
20:34 🔗 joeyh jbenet: I wonder about dht scalability, etc to so many objects though
20:34 🔗 jbenet btw joeyh: we can make arbitrary file chunking datastructures, right now we use the simplest thing possible but if theres a file chunking / index datastructure that optimizes the IA use we can do that
20:36 🔗 jbenet joeyh: hmm-- dhts scale pretty well. if the use case foresees only accessing whole files (mdag roots), we can even run a separate dht that only advertises the roots.
20:36 🔗 jbenet there's other solutions being discussed, the good thing is that it's ~100loc to try something else.
20:36 🔗 xmc IA devs are traditionally fans of simple, popular things (tar, jpeg, txt)
20:37 🔗 xmc if that helps with guiding design
20:37 🔗 xmc but hey, working code works
20:39 🔗 jbenet how well certain use cases (like video streaming of massive files) can perform
20:39 🔗 jbenet xmc: yeah, makes sense -- ipfs splits large unix files into sub-blocks, (think of how a unix fs works underneath the hood) -- the indexing datastructure is pluggable, so you can use -- say, the ext4 indirect block layout, or something else depending on the use case. probably doesnt matter here at all-- just something that we have because it really changes
20:39 🔗 xmc *nod*
21:22 🔗 Start has quit (Disconnected.)
21:28 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
21:59 🔗 edward_ has quit (Ping timeout: 512 seconds)
22:19 🔗 Start has quit (Disconnected.)
23:06 🔗 mianaai` (~user@[redacted]) has joined #internetarchive.bak
23:08 🔗 mianaai has quit (Read error: Operation timed out)
23:09 🔗 dirt_ (james@[redacted]) has joined #internetarchive.bak
23:09 🔗 dirt has quit (Ping timeout: 258 seconds)
23:09 🔗 dirt_ is now known as dirt
23:09 🔗 DFJustin has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 GauntletW has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 db48x has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 ersi has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 jake1 has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 mianaai` has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 midas has quit (hub.efnet.us irc.Prison.NET)
23:09 🔗 yhager has quit (hub.efnet.us irc.Prison.NET)
23:14 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
23:15 🔗 DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
23:15 🔗 GauntletW (~ted@[redacted]) has joined #internetarchive.bak
23:15 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
23:15 🔗 ersi (~ersi@[redacted]) has joined #internetarchive.bak
23:15 🔗 irc.Prison.NET gives channel operator status to db48x DFJustin
23:15 🔗 jake1 (~Adium@[redacted]) has joined #internetarchive.bak
23:15 🔗 mianaai` (~user@[redacted]) has joined #internetarchive.bak
23:15 🔗 midas (~midas@[redacted]) has joined #internetarchive.bak
23:15 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
23:24 🔗 Start has quit (Disconnected.)
23:25 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
23:30 🔗 mianaai`` (~user@[redacted]) has joined #internetarchive.bak
23:31 🔗 Start has quit (Ping timeout: 370 seconds)
23:33 🔗 ersi has quit (Ping timeout: 258 seconds)
23:33 🔗 ersi (~ersi@[redacted]) has joined #internetarchive.bak
23:33 🔗 mianaai` has quit (Read error: Operation timed out)
23:36 🔗 cf_ (~nickgrego@[redacted]) has left #internetarchive.bak
23:38 🔗 SketchCow hi
23:39 🔗 SketchCow I am digging my car iut.
23:39 🔗 SketchCow oit.
23:39 🔗 SketchCow snow
23:39 🔗 SketchCow tiny phone keyboard.
23:39 🔗 SketchCow anyway. I will be aroubd tonight.
23:42 🔗 DFJustin has quit (hub.efnet.us irc.Prison.NET)
23:42 🔗 GauntletW has quit (hub.efnet.us irc.Prison.NET)
23:42 🔗 db48x has quit (hub.efnet.us irc.Prison.NET)
23:42 🔗 jake1 has quit (hub.efnet.us irc.Prison.NET)
23:42 🔗 mianaai`` has quit (hub.efnet.us irc.Prison.NET)
23:42 🔗 midas has quit (hub.efnet.us irc.Prison.NET)
23:42 🔗 yhager has quit (hub.efnet.us irc.Prison.NET)
23:48 🔗 DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
23:48 🔗 GauntletW (~ted@[redacted]) has joined #internetarchive.bak
23:48 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
23:48 🔗 irc.Prison.NET gives channel operator status to db48x DFJustin
23:48 🔗 jake1 (~Adium@[redacted]) has joined #internetarchive.bak
23:48 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
23:49 🔗 db48x has quit (Ping timeout: 258 seconds)
23:50 🔗 midas1 (~midas@[redacted]) has joined #internetarchive.bak

irclogger-viewer