[00:09] *** bpye_ (~quassel@[redacted]) has joined #internetarchive.bak
[00:09] *** bpye_ is now known as bpye
[00:14] *** edward_ (~edward@[redacted]) has joined #internetarchive.bak
[00:25] *** slagfart (cb2352ae@[redacted]) has joined #internetarchive.bak
[00:25] *** slagfart has quit (Client Quit)
[00:26] *** Slagfart (webchat@[redacted]) has joined #internetarchive.bak
[00:27] <Slagfart> anyone solved it yet?
[00:33] *** edward_ has quit (Ping timeout: 512 seconds)
[00:33] <db48x> Slagfart: solved what, exactly?
[00:37] <fenn> goldbach's conjecture, duh...
[00:43] <xmc> a way to prevent people from cheating on actually storing data would be to have the checks be for a random byte-range of the file, rather than the whole thing
[00:44] <jbenet> xmc: look into proofs of retrievability
[00:44] <xmc> i haven't, maybe later
[00:44] <jbenet> my personal favorite: https://cseweb.ucsd.edu/~hovav/dist/verstore.pdf
[00:45] <jbenet> but also, a much simpler merkle-tree (actual merkle tree) proof-of-storage will work just fine.
[00:45] *** knytt (~knytt@[redacted]) has joined #internetarchive.bak
[00:45] <knytt> I am drunk and I want to know everything that has happened with this so far
[00:45] <knytt> spare no details
[00:46] <knytt> for I am smart and can digest information
[00:46] <garyrh> knytt, look at http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK
[00:47] <knytt> garyrh, you are a gentleman and a scholar
[00:50] <closure> xmc: the problem with that method is thast if someone wants to be an asshat, they do this: 1. pretend to have a lot of files. 2. find the urls to download those files from the IA. 3. when asked for a proof, download just the byte range needed to generate the "proof"
[00:50] <closure> or more simply, just actually store files, but if IA ever asks for them back, damand $$$
[00:58] <garyrh> yeah, I think the only way to reduce the chance of that happening would be some kind of trust system
[00:58] <garyrh> like registered accounts are more trustworthy than anonymous, etc.
[00:58] <jbenet> you can do it with crypto, but it gets hard.
[00:59] <jbenet> thats what the proofs-of-retrievability are for-- they extract the file with each valid proof.
[00:59] <jbenet> it's slow, but it works in presence of an adversary that wont actually transmit the file back
[00:59] <jbenet> -- also, on outsourcing storage, http://cs.umd.edu/~amiller/nonoutsourceable.pdf can be adapted to prevent that
[01:00] <jbenet> it's a very cool idea: you force them to do proofs with their private key (thats used for some reward/memebership) as fast as they can, and based on sequentially determined indices into the file, so they are _forced_ to keep the entire file locally.
[01:01] <garyrh> ah, that's interesting.
[01:01] *** Peter_ (~Peter1231@[redacted]) has joined #internetarchive.bak
[01:01] <jbenet> i think https://www.cs.umd.edu/~elaine/docs/permacoin.pdf  has the construction
[01:01] <Peter_> hey
[01:02] <jbenet> (we're building something similar too-- http://filecoin.io/filecoin.pdf -- and we may borrow nonoutsourceability from permacoin down the road)
[01:02] <jbenet> garyrh, xmc, we're pretty serious about supporting you guys in this endeavor. it matters a lot to us
[01:03] <jbenet> closure pointed me to the proposal he wrote-- i'll try do the same
[01:04] <fenn> an adversary who is holding the file ransom wouldn't continue responding to large numbers of proofs of retrievability requests
[01:05] *** Peter_ has quit (Peter_)
[01:06] <jbenet> fenn: yeah that's accounted for in these protocols. that counts as malicious behavior that is not rewarded (with trust, monetary rewards, whatever)
[01:06] <jbenet> you can ride a line of "being too computationally busy to respond" and "limiting responses"
[01:06] <closure> jbenet: I suggest your proposal be shorter -- I suspect ipfs will also allow it to be more elegant ;)
[01:06] <jbenet> but the point is you can get the file out eventually, which is useful.
[01:07] <closure> problem is, an actor can behave entirely trustworthily, until the IA burns down, and then switch to ransom mode.
[01:08] <jbenet> (also for retrieving the file through the proofs, the SW scheme is worse, because it's compact. it's better to have one of the more bandwidth intensive ones)
[01:09] <fenn> i'm not sure this is really a problem. the ransomer would have to own all backup copies of the file
[01:09] <jbenet> closure: yeah which is why often these protocols from academia are cast as "cloud storage from company X" who has trust to lose. and the blockchain protocols cast in terms of proofs they must do to win money
[01:09] <closure> if you have an ongoing incentive system, I can see those proofs working, but in our case, there's not currently much incentive aside from nice/easy, and at the point the data needs to be retrived, the situation has changed
[01:09] <jbenet> closure: yep. agreed
[01:10] <closure> all I can see to do about this is try to avoid situations where only bad actors have a given file
[01:11] <jbenet> or have an incentive system that is independent of the daat
[01:11] <jbenet> data*
[01:11] <jbenet> (eg. the cryptocurrency approaches)
[01:12] *** thunk (4746deec@[redacted]) has joined #internetarchive.bak
[01:12] <jbenet> (pre-encrypting the data helps. though in the archive's case.... who else would store that much data? {archive, google, gov agencies}
[01:13] <Slagfart> why differentiate between the retrieval process itself, and proof of ownership?
[01:13] <Slagfart> wouldn't seeding valid data to other users effectively be the same as non-ransomability?
[01:14] <closure> no, because it's easy to automate detecting "fire at IA, all data lost" and switching strategies to defection
[01:14] <Slagfart> say you've got a torrent swarm going, and IA burns down. IA should simply reconnect as a peer, and do it quietly.
[01:15] <Slagfart> really? is it? would require massive collusion
[01:15] <closure> I suppose it works for lesser disasters, like the wrong set of drives all dying
[01:16] *** zx (sid17829@[redacted]) has joined #internetarchive.bak
[01:17] <Slagfart> if anyone can join and contribute, you can inherently make an assumption that a % of users are honest. I don't think a ransomer is going to bother, because their payoff is so uncertain
[01:18] <jbenet> so, unless you're doing constant non-outsourceability proofs, all the nodes could pool their storage and store 1 replica. all non-honest nodes (i.e. epsilon rational nodes) would.
[01:18] <Slagfart> let's assume 50% of users are ransomers, and 50% of the remainder are genuine but are unreliable. you've still only got the need for 4 copies out there to cover for that
[01:19] <jbenet> so the total storage % that the honest nodes control matters a lot without NOSP (non-outsoucr....)
[01:19] <Slagfart> doesn't it only matter in relation to the total pool?
[01:20] <jbenet> right, it's the % of the total addressable storage
[01:20] <Slagfart> if it's 10% honest nodes, but the pool capacity is 2000% of what's required, you've still got two points of failure
[01:20] <jbenet> cause the honest nodes better distribute multiple replicas between them
[01:21] <Slagfart> if the pool capacity doubles (hard drives and bandwidth double in size overnight), you only need 5% honest nodes
[01:21] <Slagfart> that honest node problem has already been solved using bittorrent, via hash trees
[01:21] <jbenet> yeah, also in the pool world, sub-nodes would defect, so the ransom total is not infinity
[01:22] <jbenet> (or i think-- i havent mathed)
[01:22] <Slagfart> my understanding is, you can't fake large datasets against an arbritrary hash if the hash is modern enough (eg SHA2) without computing resources that would drastically exceed the potential payout
[01:24] <jbenet> that depends entirely on "the potential payout"-- how much are people willing to pay for the wealth of humanity's knowledge?
[01:24] <Slagfart> i reckon just torrent it imho guys hey
[01:24] <Slagfart> jbenet - I think it's very low! it's already available for free
[01:24] <jbenet> not it it's the only copy left
[01:24] <jbenet> the value increases dramatically
[01:25] <Slagfart> also let's not kid outselves - this isn't the wealth of humanity's knowledge, this is largely old Geocities sites
[01:25] <jbenet> anyway i tend to agree- simple seeding will likely work in practice (hence why ipfs doesnt include any proofs-of-retrievability)
[01:25] <jbenet> hahahaha
[01:25] <garyrh> heh not really.
[01:25] <Slagfart> I agree jbenet, but if you create this, and you see 10 seeds on the torrent swarm, I think you could rest easy
[01:26] <Slagfart> :)
[01:26] <jbenet> ok, i need to change locations  -- archiveteam, will post a proposal for you soon, would be great to work together on this, you too closure :)  <3
[01:29] <Slagfart> has anyone worked out how to zip up 20 petabytes?
[01:30] <garyrh> Very slowly.
[01:31] <Slagfart> you've got a method to keep track of what's in each zip?
[01:31] <closure> Slagfart: I think you'll find you get a torrent file of some truely amazing size (like 1 terabyte) and/or your bittorrent client runs out of ram and/or each chunk is so many giabytes in size that it turns out to be nearly impossible to complete any chunk
[01:31] *** godane has quit (Quit: Leaving.)
[01:32] <Slagfart> oh I agree - but the main page halready has the proposal to split  it up into 42k 500GB chunks
[01:32] <Slagfart> each one would be a different torrent. the piratebay for example hosts much more than this, and any given torrent can be reliably downloaded :)
[01:33] <Slagfart> there's literally a financial disincentive to seed on the piratebay, but people keep doing it. I think the altruism will be a big factor - why do you guys get donations every month?
[01:38] *** Lord_Nigh (~Lord_Nigh@[redacted]) has joined #internetarchive.bak
[01:38] <Lord_Nigh> hi all
[01:39] <closure> Slagfart: I think that could work, and it has a virtue of simplicity (aside from the whole bad actor issue)
[01:40] *** knytt has quit (Quit: Leaving)
[01:41] <Slagfart> I'm changing locations too - http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork
[01:41] <Slagfart> :)
[01:41] <Slagfart> will be back later - interesting discussion! cheers all
[01:42] <closure> Slagfart: here, I've written it up http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/torrents_implementation
[01:44] <Lord_Nigh> is prometheus from ipfs in here?
[01:44] <Lord_Nigh> https://news.ycombinator.com/item?id=9148576
[01:45] <Lord_Nigh> as for integrity validation of 500gb blocks (assuming that's how the system ends up getting done) wouldn't it make the most sense to pack a block into 475gb, pack on 25gb of parity/ecc data, and sign the whole damn thing with an IA master key?
[01:45] <garyrh> I think that's jbenet.
[01:46] *** Slagfart has quit (Ping timeout: 240 seconds)
[01:51] <jbenet> Lord_Nigh yeah that's me
[01:54] <closure> oh, top of HN right now, I see
[01:54] <closure> I've done some proposal gardening on the wiki page
[01:58] <closure> hmm, the geocities torrent was 1 tb, and it strained some torrent stuff and had trouble getting well seeded.
[02:00] *** mike` (~mike@[redacted]) has joined #internetarchive.bak
[02:05] <closure> the geocities torrent currently has 1 leech and 0 seeds :(
[02:05] <closure> and that's the 641 gb fixed version
[02:09] <fenn> what specifically went wrong with the 1TB version?
[02:21] <fenn> d'eaux https://thepiratebay.se/torrent/6350414/Geocities_-_The_PATCHED_Torrent 404's
[02:21] <closure> https://thepiratebay.se/torrent/6353395/Geocities_-_The_PATCHED_Torrent
[02:37] <S[h]O[r]T> piece size was too big
[02:50] *** Control-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak
[02:56] *** Ctrl-S has quit (Read error: Connection reset by peer)
[02:56] *** Control-S is now known as Ctrl-S
[02:56] *** godane (~slacker@[redacted]) has joined #internetarchive.bak
[02:57] *** svchfoo2 gives channel operator status to godane
[04:37] <closure> well, the torrents idea is looking a little less likely.. how do all those TB of torrents ever get seeded to start?
[04:40] <ivan`> uTorrent and others support webseeds which grabs over HTTP
[04:41] <closure> yes, but then you need a file, up for http
[04:41] <db48x> which IA already done
[04:41] <db48x> does
[04:41] <closure> maybe the IA could seed some fraction of all the files at a time, not all of them. Woud need additional PB of storage
[04:42] <closure> not in 500 gb collections of items, it doesn't
[04:42] <closure> see page
[04:43] <db48x> oh, yea
[04:44] <db48x> where you split the archive up into uniform-sized chunks, each with a torrent
[04:44] <closure> right
[04:44] <db48x> I don't see why that'd be necessary though; IA already has a torrent for every item
[04:44] <closure> which are not exactly all getting lots of seeds
[04:45] <db48x> https://ia700800.us.archive.org/17/items/ZztByEpicMegagames/ZztByEpicMegagames_archive.torrent
[04:45] <db48x> yes, thus my suggestion of a custom BitTorrent client
[04:45] <closure> it's hard to get 28 million torrents seeded I think
[04:45] <xmc> closure: hmmm, true.
[04:45] <db48x> they don't have seeds because there are too many to join any fraction of them
[04:46] <db48x> most users would have to manually click on a thousand torrent links
[04:46] <closure> well, if there are 10 thousand users, and you want 10 copies of every file, that bittorrent client would need to load up 28 thousand torrents. Is that doable? I know some people run a lot of torrents, but..
[04:47] <db48x> if instead they could download a client and tell it that they'd like it to use 100GB of space, and that they like jazz, then it could go join a bunch of torrent swarms automatically
[04:47] <closure> s/bunch/metric fuckton/
[04:47] <db48x> that's a point, yes
[04:49] <db48x> I hadn't considered the constant overhead of the swarm talking to itself; it's a good point
[04:49] <closure> assume each torrent takes I dunno, 100 kb of ram. That's 3 gb of ram used by the torrent client for 28k torrents
[04:49] <db48x> it probably dies out once the swarm stabilizes and has no peers, then picks up again when there is a new peer
[04:50] <db48x> 3gb of address space; it could be swapped out
[04:50] <db48x> it won't matter if it takes the client a few ms to swap it back and to answer a query about what pieces it has
[04:50] <closure> you have to track chunks, peers, etc, etc, I don't have numbers, but 100 kb ram seems ballpark
[04:51] <db48x> agreed
[04:52] <closure> also the whole tracker side.. anyone remember how many items are in TPB?
[04:52] <db48x> probably varies on chunk size and the number of chunks, which is quite variable for IA items
[04:53] <db48x> https://news.ycombinator.com/item?id=9149262
[04:53] <db48x> https://torrentfreak.com/download-a-copy-of-the-pirate-bay-its-only-90-mb-120209/ says 1.6 million in 2012
[04:54] <closure> that comment is wrong, you still need trackers even if using magnet links
[04:55] <db48x> yea
[04:55] <db48x> just a quick source of numbers
[04:56] <closure> yeah, sounds like there might be tracker software that can handle XX million torrents
[04:56] <closure> pretty amazing
[04:57] <db48x> yea, impressive
[04:57] <db48x> and memory gets cheaper all the time :)
[04:57] <db48x> btw, use ~~~~ on the wiki to insert a signature
[04:57] <closure> I suppose trackers could shard pretty well amoung machines
[04:58] *** WubTheCap (~wub@[redacted]) has joined #internetarchive.bak
[05:07] *** cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
[05:09] *** cf_ (~nickgrego@[redacted]) has left #internetarchive.bak
[05:09] *** cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
[05:42] *** X-Scale has quit (Ping timeout: 240 seconds)
[06:00] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
[06:05] *** dyln (~user@[redacted]) has joined #internetarchive.bak
[06:06] *** bzc6p has quit (Ping timeout: 601 seconds)
[06:25] <SketchCow> Damn, things got crazy
[06:26] <SketchCow> So.
[06:26] <SketchCow> 1. I really want to go forward with Closure's solution. It's by far my favorite.
[06:27] <SketchCow> 2. I am not against the other two proposals. If they want to initiate tests and the same methods of incrementally doing tests, go for it. Having families and methods is nice.
[06:27] <SketchCow> 3. That said, I think toerrents is insane
[06:27] <db48x> the whole idea is insane :P
[06:31] <SketchCow> Yes
[06:31] <SketchCow> I also think ipfs is interesting, but different than this.
[06:32] <SketchCow> (Niw that I read it)
[06:32] <db48x> but torrents is probably the insaner of the two
[06:32] <SketchCow> So, I will push more for closue.
[06:32] <SketchCow> Of the three, there's three
[06:32] <db48x> of the N
[06:32] <SketchCow> So, I sent our internal guy on the Censs.
[06:33] <SketchCow> Census.
[06:33] <SketchCow> 14,926,080 items in the database.
[06:33] <SketchCow> (Not darked, not in some way private, not in some way weird infrastructure in the database)
[06:33] <SketchCow> Next, he's building a full mined list of these items.
[06:33] <SketchCow> I mentioned this to Brewster
[06:34] <SketchCow> Brewster suggests Prelinger be one of the collections.
[06:35] <DFJustin> I think having 'photogenic' collections at least at first will probably help volunteer uptake / virality
[06:35] <SketchCow> Yes
[07:13] *** ersi (~ersi@[redacted]) has joined #internetarchive.bak
[07:18] *** dyln has quit (Read error: Operation timed out)
[07:50] *** pg (webchat@[redacted]) has joined #internetarchive.bak
[07:51] *** pg has quit (Client Quit)
[08:16] *** edward_ (~edward@[redacted]) has joined #internetarchive.bak
[08:31] <edward_> https://news.ycombinator.com/item?id=9147719
[08:37] <xmc> you
[08:37] <xmc> so it is you we have to blame for being paid attention to
[08:44] <Sanqui> that's kind of early attention
[09:02] *** inversech (~smuxi@[redacted]) has joined #internetarchive.bak
[09:07] *** midas (~midas@[redacted]) has joined #internetarchive.bak
[09:13] <yipdw> speaking of being photogenic, since I'm already poking around IA's statusboard, might as well also look into how tied it is to showing book scans
[09:14] <yipdw> if there's a way to adapt its presentation style to show any item preview that might be fun
[09:19] *** hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak
[09:20] <hatseflat> hi everyone
[09:47] *** aschmitz has quit (Ping timeout: 265 seconds)
[09:56] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
[09:58] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
[10:58] <Sanqui> yipdw: it would be really cool if you could show thumbnails of the images, snippets of the text, samples of the audio, etc. inside the archives
[11:09] *** Cameron_D (~cameron@[redacted]) has joined #internetarchive.bak
[11:25] *** zx (sid17829@[redacted]) has left #internetarchive.bak
[13:01] *** godane has quit (Quit: Leaving.)
[13:13] *** trs80 (trs80@[redacted]) has joined #internetarchive.bak
[13:14] *** svchfoo1 gives channel operator status to trs80
[13:14] *** svchfoo2 gives channel operator status to trs80
[13:15] *** nicoo (~nico@[redacted]) has joined #internetarchive.bak
[13:26] *** godane (~slacker@[redacted]) has joined #internetarchive.bak
[13:27] *** svchfoo2 gives channel operator status to godane
[13:36] *** ams (webchat@[redacted]) has joined #internetarchive.bak
[13:43] *** phuzion (~phuzion@[redacted]) has joined #internetarchive.bak
[13:52] <cf_> Hi all. Just want to toss out my 2 cents: 1) I donâ€™t think the pure torrents solution will work - there are just too many intermediary steps to repack the data and too much to keep track of afterwards. Itâ€™s possible and the easiest solution upfront, but in the long run I donâ€™t think itâ€™ll work out too well. 2) IPFS is really neat, but just isnâ€™t ready yet. Iâ€™ve been running some tests on a VPS with one of our twitch megawarcs and itâ€
[13:52] <cf_> just not ready for the level of use that we would be forcing on it. I also haven’t seen a way to easily determine exactly which other nodes should receive a given file - i.e. it seems like you just push the file out and the network decides who gets copies of it. I say this only because if we’re going to be pushing an extraordinary amount of pretty important data on to the network (and I think 21PB qualifies for this), I feel like we need to b
[13:52] <cf_> able to control it so that it only goes onto machines owned by people who are willing to keep up a certain level of uptime, throughput, etc. 3) that leaves us with git-annex, which I think is the best solution. The dev is part of the team, the design proposal makes it look dead simple to use and implement a solution with, and the only issue that I see is the file limit, but (as discussed) that’s easily fixed with shards. Anyways, that’s just
[13:52] <cf_> POV on this.
[13:54] *** ams has quit (Ping timeout: 240 seconds)
[14:00] *** VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
[14:03] *** raylee (~raylee@[redacted]) has joined #internetarchive.bak
[14:18] <closure> SketchCow: it's funny, torrents is my favorite solution, if seeding can be solved :)
[14:19] <closure> one seeding solution is to just start with 500 gb torrent 1, get it seeded to enough people we trust it will live, and delete from our server which moves on to 500 gb torrent 2
[14:20] <closure> and then if torrent 1 becomes unhealthy, we either a) recreate it from IA Items and get it re-seeded, or b) find we cannot do that anymore (eg, some Items went dark or were modified) and stop offering torrent 1, instead offering torrennt 1B which contains all the items that were in torrrent 1
[14:20] <closure> seems pretty doable
[14:21] <ersi> cf_: Pro-tip: This is IRC, if you paste long text - it's going to get cut/wrapped
[14:21] <ersi> s/cut\/wrapped/truncated/
[14:21] <Sanqui> not a fan of the torrent solution
[14:21] <ersi> I'm a huge fan
[14:22] *** ersi blows some hot air around
[14:22] <closure> it's easy, and I don't have to do any off the work >> fan
[14:22] <Sanqui> it would be cool if distribution was..  distributed a la torrents
[14:22] <Sanqui> but torrents are unfriendly to splitting and cold storage
[14:22] <closure> this is true
[14:23] <closure> git-annex wins there
[14:23] <Sanqui> would it be possible for git-annex to be peer to peer?
[14:24] <closure> if the peers have some way of being introduced and communicating, yes
[14:25] <Sanqui> that's the only advantage I saw in torrents (besides simplicity of implementation)
[14:26] <arkiver> kut
[14:26] <arkiver> oops
[14:33] *** closure is adding ipfs support to git-annex this morning, BTW
[14:33] <closure> :)
[14:54] <phuzion> What about Tahoe-LAFS?
[14:55] <phuzion> I dunno how well it would handle 21PB, but if it handles it well, I think it could certainly be a contender for a storage solution.
[15:14] *** csssuf (~csssuf@[redacted]) has joined #internetarchive.bak
[15:26] *** Start has quit (Disconnected.)
[16:02] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[16:09] *** Start has quit (Read error: Connection reset by peer)
[16:09] *** Start_ (~Start@[redacted]) has joined #internetarchive.bak
[16:20] *** enkiv2 (~john@[redacted]) has joined #internetarchive.bak
[16:24] *** nicoo has quit (Ping timeout: 260 seconds)
[16:52] *** Start_ has quit (Disconnected.)
[17:01] *** nicoo (~nico@[redacted]) has joined #internetarchive.bak
[17:10] *** wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
[17:32] <midas> lol arkiver :p
[17:32] <arkiver> :/
[17:32] <midas> the kut moment
[17:32] <arkiver> was working on something, didn't work, typed in wrong chat :P
[17:33] <midas> :p
[17:39] *** everdred (~irssi@[redacted]) has left #internetarchive.bak
[17:49] *** WubTheCap has quit (Quit: Restart)
[17:50] *** WubTheCap (~wub@[redacted]) has joined #internetarchive.bak
[17:53] <ersi> kut
[18:01] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[18:13] *** yhager (~yuval@[redacted]) has joined #internetarchive.bak
[18:20] *** destrudo (~destrudo@[redacted]) has joined #internetarchive.bak
[18:20] *** shabble (~shabble@[redacted]) has joined #internetarchive.bak
[18:21] *** mianaai (~user@[redacted]) has joined #internetarchive.bak
[18:21] <mianaai> hi
[18:25] <ersi> Hi.
[18:27] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
[18:31] *** bzc6p_ has quit (Read error: Operation timed out)
[18:42] *** Start has quit (Disconnected.)
[18:57] *** closure is now known as joeyh
[18:57] *** bzc6p__ is now known as bzc6p
[19:22] *** jake1 (~Adium@[redacted]) has joined #internetarchive.bak
[19:53] *** VADemon has quit (Read error: Connection reset by peer)
[20:10] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[20:12] *** X-Scale has quit (Ping timeout: 240 seconds)
[20:14] <jbenet> SketchCow, cf_, joeyh: yeah ipfs is not perfectly ready yet, but please note we're moving really fast. 6mo ago nothing worked. i think we can get the reliability you need reasonably quickly-- and we are eager to help get there (we have to do it anyway). these projects tend to last _years_.  maybe a good thing to do is this: give us a measure of the
[20:14] <jbenet> even faster. )
[20:14] <jbenet> reliability you need and we can optimize for it-- i.e. point us to a specific sample workload (say "replicate this specific 1TB archive") and we can race to it. -- you can keep doing what you're doing and if we make the needed progress reasonably quickly you can re-eval then. (  also, joeyh: if we can get your help/advice along the way we can probably move
[20:14] <jbenet> (reasonably quickly measured in weeks)
[20:15] <jbenet> also-- if i were you, i would built a layer of indirection between any backend system, so you can upgrade over time. these efforts last _years_.
[20:15] <jbenet> can upgrade/not forcibly-depend on anything
[20:19] <joeyh> +1 layer of indirection
[20:20] <joeyh> jbenet: so, looking at the ipfs data model and mapping it onto this, I imagine something like:
[20:20] <joeyh> 1. IA adds each of their items to ipfs, gets the ipfs address for it
[20:20] <joeyh> 2. users can then download the items into their own ipfs nodes
[20:21] <joeyh> 3. then we need some way for users to communicate (or better, prove, but..) that they are backing up a given item
[20:22] <joeyh> #3 could be done by the user setting up their own ipns namespace, and publishing a list of the items that have there, I suppose
[20:23] *** mike` (~mike@[redacted]) has left #internetarchive.bak
[20:24] <joeyh> #1 is somewhat problimatic to do without using vast amounts of disk space used at the IA for ipfs
[20:24] <jbenet> joeyh: yeah exactly. ipfs can also be used as a lib-- currently in Go, but can easily make a special binary with your own protocol that can do more sophisticated things if you need them.
[20:24] <jbenet> my guess is any stock ipfs client can do what you need, but the power is there if you need it
[20:25] <joeyh> what seems to remain is scalability, cf my experience last night with an OOM kill of ipfs when downloading a few hundred MB
[20:25] <jbenet> on #1 one possibility is to make ipfs use an index on existing fs-- this is certainly not a good model for the average user, but for dedicated installations and PB of data, you dont want to throw it all into leveldb ;P
[20:25] <joeyh> ah, that would be great for #1
[20:25] <jbenet> joeyh: what is the size of the machine?
[20:26] <joeyh> that machine has 2 gb of ram, probably well over 1 gb free
[20:26] <jbenet> joeyh: i havent seen OOM for a long time-- it may be something about getting into a weird state
[20:27] <jbenet> (i've been booting GB vms this week)
[20:27] *** Start has quit (Disconnected.)
[20:28] <jbenet> joeyh: can you repro reliably? would be awesome to get a stack trace (can kill a go proc witl crtl+\ anytime)
[20:28] <joeyh> largest file in the IA is apparently 2 tb, for reference
[20:28] <jbenet> (( anyway would love to sink deeper into this one test run-- but also dont want to pollute this channel -- either way :) ))
[20:28] <joeyh> ok, if that's an unexpected bug, I'll try to repro it
[20:29] <joeyh> but then it curves right down to 100 gb or so files
[20:29] <joeyh> of course, there's also the question of scaling to a great many items
[20:31] <jbenet> joeyh: awesome, is there a text file with all these sizes? how many 50-200GB+ files? -- i could generate random data of similar sizes and treat that as a test suite to go for)
[20:31] <joeyh> apparenty the IA is working on getting us a full list of Items
[20:31] <jbenet> joeyh: yep, i think the sanest thing is to shard right now (can use an index of ipfs objects themselves and tell different sets of nodes to `ipfs pin -r` different subsets
[20:31] <jbenet> sweet!
[20:32] <joeyh> my impression was not many > 100 gb
[20:32] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[20:32] <joeyh> and, 18 million items, could easily be another order of magnitude files
[20:32] <xmc> yeah. for a long time IA policy was to strive to keep items under 10G each
[20:32] <xmc> archiveteam blew right through that
[20:33] <xmc> 10G because items can't be split across physical volumes, in current design
[20:34] <jbenet> xmc: why not? cant split items into subblocks?
[20:34] <yipdw> you'll find a couple hundred 50 GB items in the archiveteam collections
[20:34] <xmc> jbenet: i think they treat an item as a directory in a real unix filesystem
[20:34] <joeyh> jbenet: I wonder about dht scalability, etc to so many objects though
[20:34] <jbenet> btw joeyh: we can make arbitrary file chunking datastructures, right now we use the simplest thing possible but if theres a file chunking / index datastructure that optimizes the IA use we can do that
[20:36] <jbenet> joeyh: hmm-- dhts scale pretty well. if the use case foresees only accessing whole files (mdag roots), we can even run a separate dht that only advertises the roots.
[20:36] <jbenet> there's other solutions being discussed, the good thing is that it's ~100loc to try something else.
[20:36] <xmc> IA devs are traditionally fans of simple, popular things (tar, jpeg, txt)
[20:37] <xmc> if that helps with guiding design
[20:37] <xmc> but hey, working code works
[20:39] <jbenet> how well certain use cases (like video streaming of massive files) can perform
[20:39] <jbenet> xmc: yeah, makes sense -- ipfs splits large unix files into sub-blocks, (think of how a unix fs works underneath the hood) -- the indexing datastructure is pluggable, so you can use -- say, the ext4 indirect block layout, or something else depending on the use case. probably doesnt matter here at all-- just something that we have because it really changes
[20:39] <xmc> *nod*
[21:22] *** Start has quit (Disconnected.)
[21:28] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[21:59] *** edward_ has quit (Ping timeout: 512 seconds)
[22:19] *** Start has quit (Disconnected.)
[23:06] *** mianaai` (~user@[redacted]) has joined #internetarchive.bak
[23:08] *** mianaai has quit (Read error: Operation timed out)
[23:09] *** dirt_ (james@[redacted]) has joined #internetarchive.bak
[23:09] *** dirt has quit (Ping timeout: 258 seconds)
[23:09] *** dirt_ is now known as dirt
[23:09] *** DFJustin has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** GauntletW has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** db48x has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** ersi has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** jake1 has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** mianaai` has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** midas has quit (hub.efnet.us irc.Prison.NET)
[23:09] *** yhager has quit (hub.efnet.us irc.Prison.NET)
[23:14] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[23:15] *** DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
[23:15] *** GauntletW (~ted@[redacted]) has joined #internetarchive.bak
[23:15] *** db48x (~user@[redacted]) has joined #internetarchive.bak
[23:15] *** ersi (~ersi@[redacted]) has joined #internetarchive.bak
[23:15] *** irc.Prison.NET gives channel operator status to db48x DFJustin
[23:15] *** jake1 (~Adium@[redacted]) has joined #internetarchive.bak
[23:15] *** mianaai` (~user@[redacted]) has joined #internetarchive.bak
[23:15] *** midas (~midas@[redacted]) has joined #internetarchive.bak
[23:15] *** yhager (~yuval@[redacted]) has joined #internetarchive.bak
[23:24] *** Start has quit (Disconnected.)
[23:25] *** Start (~Start@[redacted]) has joined #internetarchive.bak
[23:30] *** mianaai`` (~user@[redacted]) has joined #internetarchive.bak
[23:31] *** Start has quit (Ping timeout: 370 seconds)
[23:33] *** ersi has quit (Ping timeout: 258 seconds)
[23:33] *** ersi (~ersi@[redacted]) has joined #internetarchive.bak
[23:33] *** mianaai` has quit (Read error: Operation timed out)
[23:36] *** cf_ (~nickgrego@[redacted]) has left #internetarchive.bak
[23:38] <SketchCow> hi
[23:39] <SketchCow> I am digging my car iut.
[23:39] <SketchCow> oit.
[23:39] <SketchCow> snow
[23:39] <SketchCow> tiny phone keyboard.
[23:39] <SketchCow> anyway. I will be aroubd tonight.
[23:42] *** DFJustin has quit (hub.efnet.us irc.Prison.NET)
[23:42] *** GauntletW has quit (hub.efnet.us irc.Prison.NET)
[23:42] *** db48x has quit (hub.efnet.us irc.Prison.NET)
[23:42] *** jake1 has quit (hub.efnet.us irc.Prison.NET)
[23:42] *** mianaai`` has quit (hub.efnet.us irc.Prison.NET)
[23:42] *** midas has quit (hub.efnet.us irc.Prison.NET)
[23:42] *** yhager has quit (hub.efnet.us irc.Prison.NET)
[23:48] *** DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
[23:48] *** GauntletW (~ted@[redacted]) has joined #internetarchive.bak
[23:48] *** db48x (~user@[redacted]) has joined #internetarchive.bak
[23:48] *** irc.Prison.NET gives channel operator status to db48x DFJustin
[23:48] *** jake1 (~Adium@[redacted]) has joined #internetarchive.bak
[23:48] *** yhager (~yuval@[redacted]) has joined #internetarchive.bak
[23:49] *** db48x has quit (Ping timeout: 258 seconds)
[23:50] *** midas1 (~midas@[redacted]) has joined #internetarchive.bak