#internetarchive.bak 2015-03-05,Thu

↑back Search

Time	Nickname	Message
00:09 ^🔗		bpye_ (~quassel@[redacted]) has joined #internetarchive.bak
00:09 ^🔗		bpye_ is now known as bpye
00:14 ^🔗		edward_ (~edward@[redacted]) has joined #internetarchive.bak
00:25 ^🔗		slagfart (cb2352ae@[redacted]) has joined #internetarchive.bak
00:25 ^🔗		slagfart has quit (Client Quit)
00:26 ^🔗		Slagfart (webchat@[redacted]) has joined #internetarchive.bak
00:27 ^🔗	Slagfart	anyone solved it yet?
00:33 ^🔗		edward_ has quit (Ping timeout: 512 seconds)
00:33 ^🔗	db48x	Slagfart: solved what, exactly?
00:37 ^🔗	fenn	goldbach's conjecture, duh...
00:43 ^🔗	xmc	a way to prevent people from cheating on actually storing data would be to have the checks be for a random byte-range of the file, rather than the whole thing
00:44 ^🔗	jbenet	xmc: look into proofs of retrievability
00:44 ^🔗	xmc	i haven't, maybe later
00:44 ^🔗	jbenet	my personal favorite: https://cseweb.ucsd.edu/~hovav/dist/verstore.pdf
00:45 ^🔗	jbenet	but also, a much simpler merkle-tree (actual merkle tree) proof-of-storage will work just fine.
00:45 ^🔗		knytt (~knytt@[redacted]) has joined #internetarchive.bak
00:45 ^🔗	knytt	I am drunk and I want to know everything that has happened with this so far
00:45 ^🔗	knytt	spare no details
00:46 ^🔗	knytt	for I am smart and can digest information
00:46 ^🔗	garyrh	knytt, look at http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK
00:47 ^🔗	knytt	garyrh, you are a gentleman and a scholar
00:50 ^🔗	closure	xmc: the problem with that method is thast if someone wants to be an asshat, they do this: 1. pretend to have a lot of files. 2. find the urls to download those files from the IA. 3. when asked for a proof, download just the byte range needed to generate the "proof"
00:50 ^🔗	closure	or more simply, just actually store files, but if IA ever asks for them back, damand $$$
00:58 ^🔗	garyrh	yeah, I think the only way to reduce the chance of that happening would be some kind of trust system
00:58 ^🔗	garyrh	like registered accounts are more trustworthy than anonymous, etc.
00:58 ^🔗	jbenet	you can do it with crypto, but it gets hard.
00:59 ^🔗	jbenet	thats what the proofs-of-retrievability are for-- they extract the file with each valid proof.
00:59 ^🔗	jbenet	it's slow, but it works in presence of an adversary that wont actually transmit the file back
00:59 ^🔗	jbenet	-- also, on outsourcing storage, http://cs.umd.edu/~amiller/nonoutsourceable.pdf can be adapted to prevent that
01:00 ^🔗	jbenet	it's a very cool idea: you force them to do proofs with their private key (thats used for some reward/memebership) as fast as they can, and based on sequentially determined indices into the file, so they are _forced_ to keep the entire file locally.
01:01 ^🔗	garyrh	ah, that's interesting.
01:01 ^🔗		Peter_ (~Peter1231@[redacted]) has joined #internetarchive.bak
01:01 ^🔗	jbenet	i think https://www.cs.umd.edu/~elaine/docs/permacoin.pdf has the construction
01:01 ^🔗	Peter_	hey
01:02 ^🔗	jbenet	(we're building something similar too-- http://filecoin.io/filecoin.pdf -- and we may borrow nonoutsourceability from permacoin down the road)
01:02 ^🔗	jbenet	garyrh, xmc, we're pretty serious about supporting you guys in this endeavor. it matters a lot to us
01:03 ^🔗	jbenet	closure pointed me to the proposal he wrote-- i'll try do the same
01:04 ^🔗	fenn	an adversary who is holding the file ransom wouldn't continue responding to large numbers of proofs of retrievability requests
01:05 ^🔗		Peter_ has quit (Peter_)
01:06 ^🔗	jbenet	fenn: yeah that's accounted for in these protocols. that counts as malicious behavior that is not rewarded (with trust, monetary rewards, whatever)
01:06 ^🔗	jbenet	you can ride a line of "being too computationally busy to respond" and "limiting responses"
01:06 ^🔗	closure	jbenet: I suggest your proposal be shorter -- I suspect ipfs will also allow it to be more elegant ;)
01:06 ^🔗	jbenet	but the point is you can get the file out eventually, which is useful.
01:07 ^🔗	closure	problem is, an actor can behave entirely trustworthily, until the IA burns down, and then switch to ransom mode.
01:08 ^🔗	jbenet	(also for retrieving the file through the proofs, the SW scheme is worse, because it's compact. it's better to have one of the more bandwidth intensive ones)
01:09 ^🔗	fenn	i'm not sure this is really a problem. the ransomer would have to own all backup copies of the file
01:09 ^🔗	jbenet	closure: yeah which is why often these protocols from academia are cast as "cloud storage from company X" who has trust to lose. and the blockchain protocols cast in terms of proofs they must do to win money
01:09 ^🔗	closure	if you have an ongoing incentive system, I can see those proofs working, but in our case, there's not currently much incentive aside from nice/easy, and at the point the data needs to be retrived, the situation has changed
01:09 ^🔗	jbenet	closure: yep. agreed
01:10 ^🔗	closure	all I can see to do about this is try to avoid situations where only bad actors have a given file
01:11 ^🔗	jbenet	or have an incentive system that is independent of the daat
01:11 ^🔗	jbenet	data*
01:11 ^🔗	jbenet	(eg. the cryptocurrency approaches)
01:12 ^🔗		thunk (4746deec@[redacted]) has joined #internetarchive.bak
01:12 ^🔗	jbenet	(pre-encrypting the data helps. though in the archive's case.... who else would store that much data? {archive, google, gov agencies}
01:13 ^🔗	Slagfart	why differentiate between the retrieval process itself, and proof of ownership?
01:13 ^🔗	Slagfart	wouldn't seeding valid data to other users effectively be the same as non-ransomability?
01:14 ^🔗	closure	no, because it's easy to automate detecting "fire at IA, all data lost" and switching strategies to defection
01:14 ^🔗	Slagfart	say you've got a torrent swarm going, and IA burns down. IA should simply reconnect as a peer, and do it quietly.
01:15 ^🔗	Slagfart	really? is it? would require massive collusion
01:15 ^🔗	closure	I suppose it works for lesser disasters, like the wrong set of drives all dying
01:16 ^🔗		zx (sid17829@[redacted]) has joined #internetarchive.bak
01:17 ^🔗	Slagfart	if anyone can join and contribute, you can inherently make an assumption that a % of users are honest. I don't think a ransomer is going to bother, because their payoff is so uncertain
01:18 ^🔗	jbenet	so, unless you're doing constant non-outsourceability proofs, all the nodes could pool their storage and store 1 replica. all non-honest nodes (i.e. epsilon rational nodes) would.
01:18 ^🔗	Slagfart	let's assume 50% of users are ransomers, and 50% of the remainder are genuine but are unreliable. you've still only got the need for 4 copies out there to cover for that
01:19 ^🔗	jbenet	so the total storage % that the honest nodes control matters a lot without NOSP (non-outsoucr....)
01:19 ^🔗	Slagfart	doesn't it only matter in relation to the total pool?
01:20 ^🔗	jbenet	right, it's the % of the total addressable storage
01:20 ^🔗	Slagfart	if it's 10% honest nodes, but the pool capacity is 2000% of what's required, you've still got two points of failure
01:20 ^🔗	jbenet	cause the honest nodes better distribute multiple replicas between them
01:21 ^🔗	Slagfart	if the pool capacity doubles (hard drives and bandwidth double in size overnight), you only need 5% honest nodes
01:21 ^🔗	Slagfart	that honest node problem has already been solved using bittorrent, via hash trees
01:21 ^🔗	jbenet	yeah, also in the pool world, sub-nodes would defect, so the ransom total is not infinity
01:22 ^🔗	jbenet	(or i think-- i havent mathed)
01:22 ^🔗	Slagfart	my understanding is, you can't fake large datasets against an arbritrary hash if the hash is modern enough (eg SHA2) without computing resources that would drastically exceed the potential payout
01:24 ^🔗	jbenet	that depends entirely on "the potential payout"-- how much are people willing to pay for the wealth of humanity's knowledge?
01:24 ^🔗	Slagfart	i reckon just torrent it imho guys hey
01:24 ^🔗	Slagfart	jbenet - I think it's very low! it's already available for free
01:24 ^🔗	jbenet	not it it's the only copy left
01:24 ^🔗	jbenet	the value increases dramatically
01:25 ^🔗	Slagfart	also let's not kid outselves - this isn't the wealth of humanity's knowledge, this is largely old Geocities sites
01:25 ^🔗	jbenet	anyway i tend to agree- simple seeding will likely work in practice (hence why ipfs doesnt include any proofs-of-retrievability)
01:25 ^🔗	jbenet	hahahaha
01:25 ^🔗	garyrh	heh not really.
01:25 ^🔗	Slagfart	I agree jbenet, but if you create this, and you see 10 seeds on the torrent swarm, I think you could rest easy
01:26 ^🔗	Slagfart	:)
01:26 ^🔗	jbenet	ok, i need to change locations -- archiveteam, will post a proposal for you soon, would be great to work together on this, you too closure :) <3
01:29 ^🔗	Slagfart	has anyone worked out how to zip up 20 petabytes?
01:30 ^🔗	garyrh	Very slowly.
01:31 ^🔗	Slagfart	you've got a method to keep track of what's in each zip?
01:31 ^🔗	closure	Slagfart: I think you'll find you get a torrent file of some truely amazing size (like 1 terabyte) and/or your bittorrent client runs out of ram and/or each chunk is so many giabytes in size that it turns out to be nearly impossible to complete any chunk
01:31 ^🔗		godane has quit (Quit: Leaving.)
01:32 ^🔗	Slagfart	oh I agree - but the main page halready has the proposal to split it up into 42k 500GB chunks
01:32 ^🔗	Slagfart	each one would be a different torrent. the piratebay for example hosts much more than this, and any given torrent can be reliably downloaded :)
01:33 ^🔗	Slagfart	there's literally a financial disincentive to seed on the piratebay, but people keep doing it. I think the altruism will be a big factor - why do you guys get donations every month?
01:38 ^🔗		Lord_Nigh (~Lord_Nigh@[redacted]) has joined #internetarchive.bak
01:38 ^🔗	Lord_Nigh	hi all
01:39 ^🔗	closure	Slagfart: I think that could work, and it has a virtue of simplicity (aside from the whole bad actor issue)
01:40 ^🔗		knytt has quit (Quit: Leaving)
01:41 ^🔗	Slagfart	I'm changing locations too - http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork
01:41 ^🔗	Slagfart	:)
01:41 ^🔗	Slagfart	will be back later - interesting discussion! cheers all
01:42 ^🔗	closure	Slagfart: here, I've written it up http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/torrents_implementation
01:44 ^🔗	Lord_Nigh	is prometheus from ipfs in here?
01:44 ^🔗	Lord_Nigh	https://news.ycombinator.com/item?id=9148576
01:45 ^🔗	Lord_Nigh	as for integrity validation of 500gb blocks (assuming that's how the system ends up getting done) wouldn't it make the most sense to pack a block into 475gb, pack on 25gb of parity/ecc data, and sign the whole damn thing with an IA master key?
01:45 ^🔗	garyrh	I think that's jbenet.
01:46 ^🔗		Slagfart has quit (Ping timeout: 240 seconds)
01:51 ^🔗	jbenet	Lord_Nigh yeah that's me
01:54 ^🔗	closure	oh, top of HN right now, I see
01:54 ^🔗	closure	I've done some proposal gardening on the wiki page
01:58 ^🔗	closure	hmm, the geocities torrent was 1 tb, and it strained some torrent stuff and had trouble getting well seeded.
02:00 ^🔗		mike` (~mike@[redacted]) has joined #internetarchive.bak
02:05 ^🔗	closure	the geocities torrent currently has 1 leech and 0 seeds :(
02:05 ^🔗	closure	and that's the 641 gb fixed version
02:09 ^🔗	fenn	what specifically went wrong with the 1TB version?
02:21 ^🔗	fenn	d'eaux https://thepiratebay.se/torrent/6350414/Geocities_-_The_PATCHED_Torrent 404's
02:21 ^🔗	closure	https://thepiratebay.se/torrent/6353395/Geocities_-_The_PATCHED_Torrent
02:37 ^🔗	S[h]O[r]T	piece size was too big
02:50 ^🔗		Control-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak
02:56 ^🔗		Ctrl-S has quit (Read error: Connection reset by peer)
02:56 ^🔗		Control-S is now known as Ctrl-S
02:56 ^🔗		godane (~slacker@[redacted]) has joined #internetarchive.bak
02:57 ^🔗		svchfoo2 gives channel operator status to godane
04:37 ^🔗	closure	well, the torrents idea is looking a little less likely.. how do all those TB of torrents ever get seeded to start?
04:40 ^🔗	ivan`	uTorrent and others support webseeds which grabs over HTTP
04:41 ^🔗	closure	yes, but then you need a file, up for http
04:41 ^🔗	db48x	which IA already done
04:41 ^🔗	db48x	does
04:41 ^🔗	closure	maybe the IA could seed some fraction of all the files at a time, not all of them. Woud need additional PB of storage
04:42 ^🔗	closure	not in 500 gb collections of items, it doesn't
04:42 ^🔗	closure	see page
04:43 ^🔗	db48x	oh, yea
04:44 ^🔗	db48x	where you split the archive up into uniform-sized chunks, each with a torrent
04:44 ^🔗	closure	right
04:44 ^🔗	db48x	I don't see why that'd be necessary though; IA already has a torrent for every item
04:44 ^🔗	closure	which are not exactly all getting lots of seeds
04:45 ^🔗	db48x	https://ia700800.us.archive.org/17/items/ZztByEpicMegagames/ZztByEpicMegagames_archive.torrent
04:45 ^🔗	db48x	yes, thus my suggestion of a custom BitTorrent client
04:45 ^🔗	closure	it's hard to get 28 million torrents seeded I think
04:45 ^🔗	xmc	closure: hmmm, true.
04:45 ^🔗	db48x	they don't have seeds because there are too many to join any fraction of them
04:46 ^🔗	db48x	most users would have to manually click on a thousand torrent links
04:46 ^🔗	closure	well, if there are 10 thousand users, and you want 10 copies of every file, that bittorrent client would need to load up 28 thousand torrents. Is that doable? I know some people run a lot of torrents, but..
04:47 ^🔗	db48x	if instead they could download a client and tell it that they'd like it to use 100GB of space, and that they like jazz, then it could go join a bunch of torrent swarms automatically
04:47 ^🔗	closure	s/bunch/metric fuckton/
04:47 ^🔗	db48x	that's a point, yes
04:49 ^🔗	db48x	I hadn't considered the constant overhead of the swarm talking to itself; it's a good point
04:49 ^🔗	closure	assume each torrent takes I dunno, 100 kb of ram. That's 3 gb of ram used by the torrent client for 28k torrents
04:49 ^🔗	db48x	it probably dies out once the swarm stabilizes and has no peers, then picks up again when there is a new peer
04:50 ^🔗	db48x	3gb of address space; it could be swapped out
04:50 ^🔗	db48x	it won't matter if it takes the client a few ms to swap it back and to answer a query about what pieces it has
04:50 ^🔗	closure	you have to track chunks, peers, etc, etc, I don't have numbers, but 100 kb ram seems ballpark
04:51 ^🔗	db48x	agreed
04:52 ^🔗	closure	also the whole tracker side.. anyone remember how many items are in TPB?
04:52 ^🔗	db48x	probably varies on chunk size and the number of chunks, which is quite variable for IA items
04:53 ^🔗	db48x	https://news.ycombinator.com/item?id=9149262
04:53 ^🔗	db48x	https://torrentfreak.com/download-a-copy-of-the-pirate-bay-its-only-90-mb-120209/ says 1.6 million in 2012
04:54 ^🔗	closure	that comment is wrong, you still need trackers even if using magnet links
04:55 ^🔗	db48x	yea
04:55 ^🔗	db48x	just a quick source of numbers
04:56 ^🔗	closure	yeah, sounds like there might be tracker software that can handle XX million torrents
04:56 ^🔗	closure	pretty amazing
04:57 ^🔗	db48x	yea, impressive
04:57 ^🔗	db48x	and memory gets cheaper all the time :)
04:57 ^🔗	db48x	btw, use ~~~~ on the wiki to insert a signature
04:57 ^🔗	closure	I suppose trackers could shard pretty well amoung machines
04:58 ^🔗		WubTheCap (~wub@[redacted]) has joined #internetarchive.bak
05:07 ^🔗		cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
05:09 ^🔗		cf_ (~nickgrego@[redacted]) has left #internetarchive.bak
05:09 ^🔗		cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
05:42 ^🔗		X-Scale has quit (Ping timeout: 240 seconds)
06:00 ^🔗		bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
06:05 ^🔗		dyln (~user@[redacted]) has joined #internetarchive.bak
06:06 ^🔗		bzc6p has quit (Ping timeout: 601 seconds)
06:25 ^🔗	SketchCow	Damn, things got crazy
06:26 ^🔗	SketchCow	So.
06:26 ^🔗	SketchCow	1. I really want to go forward with Closure's solution. It's by far my favorite.
06:27 ^🔗	SketchCow	2. I am not against the other two proposals. If they want to initiate tests and the same methods of incrementally doing tests, go for it. Having families and methods is nice.
06:27 ^🔗	SketchCow	3. That said, I think toerrents is insane
06:27 ^🔗	db48x	the whole idea is insane :P
06:31 ^🔗	SketchCow	Yes
06:31 ^🔗	SketchCow	I also think ipfs is interesting, but different than this.
06:32 ^🔗	SketchCow	(Niw that I read it)
06:32 ^🔗	db48x	but torrents is probably the insaner of the two
06:32 ^🔗	SketchCow	So, I will push more for closue.
06:32 ^🔗	SketchCow	Of the three, there's three
06:32 ^🔗	db48x	of the N
06:32 ^🔗	SketchCow	So, I sent our internal guy on the Censs.
06:33 ^🔗	SketchCow	Census.
06:33 ^🔗	SketchCow	14,926,080 items in the database.
06:33 ^🔗	SketchCow	(Not darked, not in some way private, not in some way weird infrastructure in the database)
06:33 ^🔗	SketchCow	Next, he's building a full mined list of these items.
06:33 ^🔗	SketchCow	I mentioned this to Brewster
06:34 ^🔗	SketchCow	Brewster suggests Prelinger be one of the collections.
06:35 ^🔗	DFJustin	I think having 'photogenic' collections at least at first will probably help volunteer uptake / virality
06:35 ^🔗	SketchCow	Yes
07:13 ^🔗		ersi (~ersi@[redacted]) has joined #internetarchive.bak
07:18 ^🔗		dyln has quit (Read error: Operation timed out)
07:50 ^🔗		pg (webchat@[redacted]) has joined #internetarchive.bak
07:51 ^🔗		pg has quit (Client Quit)
08:16 ^🔗		edward_ (~edward@[redacted]) has joined #internetarchive.bak
08:31 ^🔗	edward_	https://news.ycombinator.com/item?id=9147719
08:37 ^🔗	xmc	you
08:37 ^🔗	xmc	so it is you we have to blame for being paid attention to
08:44 ^🔗	Sanqui	that's kind of early attention
09:02 ^🔗		inversech (~smuxi@[redacted]) has joined #internetarchive.bak
09:07 ^🔗		midas (~midas@[redacted]) has joined #internetarchive.bak
09:13 ^🔗	yipdw	speaking of being photogenic, since I'm already poking around IA's statusboard, might as well also look into how tied it is to showing book scans
09:14 ^🔗	yipdw	if there's a way to adapt its presentation style to show any item preview that might be fun
09:19 ^🔗		hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak
09:20 ^🔗	hatseflat	hi everyone
09:47 ^🔗		aschmitz has quit (Ping timeout: 265 seconds)
09:56 ^🔗		X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
09:58 ^🔗		aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
10:58 ^🔗	Sanqui	yipdw: it would be really cool if you could show thumbnails of the images, snippets of the text, samples of the audio, etc. inside the archives
11:09 ^🔗		Cameron_D (~cameron@[redacted]) has joined #internetarchive.bak
11:25 ^🔗		zx (sid17829@[redacted]) has left #internetarchive.bak
13:01 ^🔗		godane has quit (Quit: Leaving.)
13:13 ^🔗		trs80 (trs80@[redacted]) has joined #internetarchive.bak
13:14 ^🔗		svchfoo1 gives channel operator status to trs80
13:14 ^🔗		svchfoo2 gives channel operator status to trs80
13:15 ^🔗		nicoo (~nico@[redacted]) has joined #internetarchive.bak
13:26 ^🔗		godane (~slacker@[redacted]) has joined #internetarchive.bak
13:27 ^🔗		svchfoo2 gives channel operator status to godane
13:36 ^🔗		ams (webchat@[redacted]) has joined #internetarchive.bak
13:43 ^🔗		phuzion (~phuzion@[redacted]) has joined #internetarchive.bak
13:52 ^🔗	cf_	Hi all. Just want to toss out my 2 cents: 1) I donâ€™t think the pure torrents solution will work - there are just too many intermediary steps to repack the data and too much to keep track of afterwards. Itâ€™s possible and the easiest solution upfront, but in the long run I donâ€™t think itâ€™ll work out too well. 2) IPFS is really neat, but just isnâ€™t ready yet. Iâ€™ve been running some tests on a VPS with one of our twitch megawarcs and itâ€
13:52 ^🔗	cf_	just not ready for the level of use that we would be forcing on it. I also haven’t seen a way to easily determine exactly which other nodes should receive a given file - i.e. it seems like you just push the file out and the network decides who gets copies of it. I say this only because if we’re going to be pushing an extraordinary amount of pretty important data on to the network (and I think 21PB qualifies for this), I feel like we need to b
13:52 ^🔗	cf_	able to control it so that it only goes onto machines owned by people who are willing to keep up a certain level of uptime, throughput, etc. 3) that leaves us with git-annex, which I think is the best solution. The dev is part of the team, the design proposal makes it look dead simple to use and implement a solution with, and the only issue that I see is the file limit, but (as discussed) that’s easily fixed with shards. Anyways, that’s just
13:52 ^🔗	cf_	POV on this.
13:54 ^🔗		ams has quit (Ping timeout: 240 seconds)
14:00 ^🔗		VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
14:03 ^🔗		raylee (~raylee@[redacted]) has joined #internetarchive.bak
14:18 ^🔗	closure	SketchCow: it's funny, torrents is my favorite solution, if seeding can be solved :)
14:19 ^🔗	closure	one seeding solution is to just start with 500 gb torrent 1, get it seeded to enough people we trust it will live, and delete from our server which moves on to 500 gb torrent 2
14:20 ^🔗	closure	and then if torrent 1 becomes unhealthy, we either a) recreate it from IA Items and get it re-seeded, or b) find we cannot do that anymore (eg, some Items went dark or were modified) and stop offering torrent 1, instead offering torrennt 1B which contains all the items that were in torrrent 1
14:20 ^🔗	closure	seems pretty doable
14:21 ^🔗	ersi	cf_: Pro-tip: This is IRC, if you paste long text - it's going to get cut/wrapped
14:21 ^🔗	ersi	s/cut\/wrapped/truncated/
14:21 ^🔗	Sanqui	not a fan of the torrent solution
14:21 ^🔗	ersi	I'm a huge fan
14:22 ^🔗		ersi blows some hot air around
14:22 ^🔗	closure	it's easy, and I don't have to do any off the work >> fan
14:22 ^🔗	Sanqui	it would be cool if distribution was.. distributed a la torrents
14:22 ^🔗	Sanqui	but torrents are unfriendly to splitting and cold storage
14:22 ^🔗	closure	this is true
14:23 ^🔗	closure	git-annex wins there
14:23 ^🔗	Sanqui	would it be possible for git-annex to be peer to peer?
14:24 ^🔗	closure	if the peers have some way of being introduced and communicating, yes
14:25 ^🔗	Sanqui	that's the only advantage I saw in torrents (besides simplicity of implementation)
14:26 ^🔗	arkiver	kut
14:26 ^🔗	arkiver	oops
14:33 ^🔗		closure is adding ipfs support to git-annex this morning, BTW
14:33 ^🔗	closure	:)
14:54 ^🔗	phuzion	What about Tahoe-LAFS?
14:55 ^🔗	phuzion	I dunno how well it would handle 21PB, but if it handles it well, I think it could certainly be a contender for a storage solution.
15:14 ^🔗		csssuf (~csssuf@[redacted]) has joined #internetarchive.bak
15:26 ^🔗		Start has quit (Disconnected.)
16:02 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
16:09 ^🔗		Start has quit (Read error: Connection reset by peer)
16:09 ^🔗		Start_ (~Start@[redacted]) has joined #internetarchive.bak
16:20 ^🔗		enkiv2 (~john@[redacted]) has joined #internetarchive.bak
16:24 ^🔗		nicoo has quit (Ping timeout: 260 seconds)
16:52 ^🔗		Start_ has quit (Disconnected.)
17:01 ^🔗		nicoo (~nico@[redacted]) has joined #internetarchive.bak
17:10 ^🔗		wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
17:32 ^🔗	midas	lol arkiver :p
17:32 ^🔗	arkiver	:/
17:32 ^🔗	midas	the kut moment
17:32 ^🔗	arkiver	was working on something, didn't work, typed in wrong chat :P
17:33 ^🔗	midas	:p
17:39 ^🔗		everdred (~irssi@[redacted]) has left #internetarchive.bak
17:49 ^🔗		WubTheCap has quit (Quit: Restart)
17:50 ^🔗		WubTheCap (~wub@[redacted]) has joined #internetarchive.bak
17:53 ^🔗	ersi	kut
18:01 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
18:13 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
18:20 ^🔗		destrudo (~destrudo@[redacted]) has joined #internetarchive.bak
18:20 ^🔗		shabble (~shabble@[redacted]) has joined #internetarchive.bak
18:21 ^🔗		mianaai (~user@[redacted]) has joined #internetarchive.bak
18:21 ^🔗	mianaai	hi
18:25 ^🔗	ersi	Hi.
18:27 ^🔗		bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
18:31 ^🔗		bzc6p_ has quit (Read error: Operation timed out)
18:42 ^🔗		Start has quit (Disconnected.)
18:57 ^🔗		closure is now known as joeyh
18:57 ^🔗		bzc6p__ is now known as bzc6p
19:22 ^🔗		jake1 (~Adium@[redacted]) has joined #internetarchive.bak
19:53 ^🔗		VADemon has quit (Read error: Connection reset by peer)
20:10 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
20:12 ^🔗		X-Scale has quit (Ping timeout: 240 seconds)
20:14 ^🔗	jbenet	SketchCow, cf_, joeyh: yeah ipfs is not perfectly ready yet, but please note we're moving really fast. 6mo ago nothing worked. i think we can get the reliability you need reasonably quickly-- and we are eager to help get there (we have to do it anyway). these projects tend to last _years_. maybe a good thing to do is this: give us a measure of the
20:14 ^🔗	jbenet	even faster. )
20:14 ^🔗	jbenet	reliability you need and we can optimize for it-- i.e. point us to a specific sample workload (say "replicate this specific 1TB archive") and we can race to it. -- you can keep doing what you're doing and if we make the needed progress reasonably quickly you can re-eval then. ( also, joeyh: if we can get your help/advice along the way we can probably move
20:14 ^🔗	jbenet	(reasonably quickly measured in weeks)
20:15 ^🔗	jbenet	also-- if i were you, i would built a layer of indirection between any backend system, so you can upgrade over time. these efforts last _years_.
20:15 ^🔗	jbenet	can upgrade/not forcibly-depend on anything
20:19 ^🔗	joeyh	+1 layer of indirection
20:20 ^🔗	joeyh	jbenet: so, looking at the ipfs data model and mapping it onto this, I imagine something like:
20:20 ^🔗	joeyh	1. IA adds each of their items to ipfs, gets the ipfs address for it
20:20 ^🔗	joeyh	2. users can then download the items into their own ipfs nodes
20:21 ^🔗	joeyh	3. then we need some way for users to communicate (or better, prove, but..) that they are backing up a given item
20:22 ^🔗	joeyh	#3 could be done by the user setting up their own ipns namespace, and publishing a list of the items that have there, I suppose
20:23 ^🔗		mike` (~mike@[redacted]) has left #internetarchive.bak
20:24 ^🔗	joeyh	#1 is somewhat problimatic to do without using vast amounts of disk space used at the IA for ipfs
20:24 ^🔗	jbenet	joeyh: yeah exactly. ipfs can also be used as a lib-- currently in Go, but can easily make a special binary with your own protocol that can do more sophisticated things if you need them.
20:24 ^🔗	jbenet	my guess is any stock ipfs client can do what you need, but the power is there if you need it
20:25 ^🔗	joeyh	what seems to remain is scalability, cf my experience last night with an OOM kill of ipfs when downloading a few hundred MB
20:25 ^🔗	jbenet	on #1 one possibility is to make ipfs use an index on existing fs-- this is certainly not a good model for the average user, but for dedicated installations and PB of data, you dont want to throw it all into leveldb ;P
20:25 ^🔗	joeyh	ah, that would be great for #1
20:25 ^🔗	jbenet	joeyh: what is the size of the machine?
20:26 ^🔗	joeyh	that machine has 2 gb of ram, probably well over 1 gb free
20:26 ^🔗	jbenet	joeyh: i havent seen OOM for a long time-- it may be something about getting into a weird state
20:27 ^🔗	jbenet	(i've been booting GB vms this week)
20:27 ^🔗		Start has quit (Disconnected.)
20:28 ^🔗	jbenet	joeyh: can you repro reliably? would be awesome to get a stack trace (can kill a go proc witl crtl+\ anytime)
20:28 ^🔗	joeyh	largest file in the IA is apparently 2 tb, for reference
20:28 ^🔗	jbenet	(( anyway would love to sink deeper into this one test run-- but also dont want to pollute this channel -- either way :) ))
20:28 ^🔗	joeyh	ok, if that's an unexpected bug, I'll try to repro it
20:29 ^🔗	joeyh	but then it curves right down to 100 gb or so files
20:29 ^🔗	joeyh	of course, there's also the question of scaling to a great many items
20:31 ^🔗	jbenet	joeyh: awesome, is there a text file with all these sizes? how many 50-200GB+ files? -- i could generate random data of similar sizes and treat that as a test suite to go for)
20:31 ^🔗	joeyh	apparenty the IA is working on getting us a full list of Items
20:31 ^🔗	jbenet	joeyh: yep, i think the sanest thing is to shard right now (can use an index of ipfs objects themselves and tell different sets of nodes to `ipfs pin -r` different subsets
20:31 ^🔗	jbenet	sweet!
20:32 ^🔗	joeyh	my impression was not many > 100 gb
20:32 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
20:32 ^🔗	joeyh	and, 18 million items, could easily be another order of magnitude files
20:32 ^🔗	xmc	yeah. for a long time IA policy was to strive to keep items under 10G each
20:32 ^🔗	xmc	archiveteam blew right through that
20:33 ^🔗	xmc	10G because items can't be split across physical volumes, in current design
20:34 ^🔗	jbenet	xmc: why not? cant split items into subblocks?
20:34 ^🔗	yipdw	you'll find a couple hundred 50 GB items in the archiveteam collections
20:34 ^🔗	xmc	jbenet: i think they treat an item as a directory in a real unix filesystem
20:34 ^🔗	joeyh	jbenet: I wonder about dht scalability, etc to so many objects though
20:34 ^🔗	jbenet	btw joeyh: we can make arbitrary file chunking datastructures, right now we use the simplest thing possible but if theres a file chunking / index datastructure that optimizes the IA use we can do that
20:36 ^🔗	jbenet	joeyh: hmm-- dhts scale pretty well. if the use case foresees only accessing whole files (mdag roots), we can even run a separate dht that only advertises the roots.
20:36 ^🔗	jbenet	there's other solutions being discussed, the good thing is that it's ~100loc to try something else.
20:36 ^🔗	xmc	IA devs are traditionally fans of simple, popular things (tar, jpeg, txt)
20:37 ^🔗	xmc	if that helps with guiding design
20:37 ^🔗	xmc	but hey, working code works
20:39 ^🔗	jbenet	how well certain use cases (like video streaming of massive files) can perform
20:39 ^🔗	jbenet	xmc: yeah, makes sense -- ipfs splits large unix files into sub-blocks, (think of how a unix fs works underneath the hood) -- the indexing datastructure is pluggable, so you can use -- say, the ext4 indirect block layout, or something else depending on the use case. probably doesnt matter here at all-- just something that we have because it really changes
20:39 ^🔗	xmc	nod
21:22 ^🔗		Start has quit (Disconnected.)
21:28 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
21:59 ^🔗		edward_ has quit (Ping timeout: 512 seconds)
22:19 ^🔗		Start has quit (Disconnected.)
23:06 ^🔗		mianaai` (~user@[redacted]) has joined #internetarchive.bak
23:08 ^🔗		mianaai has quit (Read error: Operation timed out)
23:09 ^🔗		dirt_ (james@[redacted]) has joined #internetarchive.bak
23:09 ^🔗		dirt has quit (Ping timeout: 258 seconds)
23:09 ^🔗		dirt_ is now known as dirt
23:09 ^🔗		DFJustin has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		GauntletW has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		db48x has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		ersi has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		jake1 has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		mianaai` has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		midas has quit (hub.efnet.us irc.Prison.NET)
23:09 ^🔗		yhager has quit (hub.efnet.us irc.Prison.NET)
23:14 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		GauntletW (~ted@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		ersi (~ersi@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		irc.Prison.NET gives channel operator status to db48x DFJustin
23:15 ^🔗		jake1 (~Adium@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		mianaai` (~user@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		midas (~midas@[redacted]) has joined #internetarchive.bak
23:15 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
23:24 ^🔗		Start has quit (Disconnected.)
23:25 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
23:30 ^🔗		mianaai`` (~user@[redacted]) has joined #internetarchive.bak
23:31 ^🔗		Start has quit (Ping timeout: 370 seconds)
23:33 ^🔗		ersi has quit (Ping timeout: 258 seconds)
23:33 ^🔗		ersi (~ersi@[redacted]) has joined #internetarchive.bak
23:33 ^🔗		mianaai` has quit (Read error: Operation timed out)
23:36 ^🔗		cf_ (~nickgrego@[redacted]) has left #internetarchive.bak
23:38 ^🔗	SketchCow	hi
23:39 ^🔗	SketchCow	I am digging my car iut.
23:39 ^🔗	SketchCow	oit.
23:39 ^🔗	SketchCow	snow
23:39 ^🔗	SketchCow	tiny phone keyboard.
23:39 ^🔗	SketchCow	anyway. I will be aroubd tonight.
23:42 ^🔗		DFJustin has quit (hub.efnet.us irc.Prison.NET)
23:42 ^🔗		GauntletW has quit (hub.efnet.us irc.Prison.NET)
23:42 ^🔗		db48x has quit (hub.efnet.us irc.Prison.NET)
23:42 ^🔗		jake1 has quit (hub.efnet.us irc.Prison.NET)
23:42 ^🔗		mianaai`` has quit (hub.efnet.us irc.Prison.NET)
23:42 ^🔗		midas has quit (hub.efnet.us irc.Prison.NET)
23:42 ^🔗		yhager has quit (hub.efnet.us irc.Prison.NET)
23:48 ^🔗		DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
23:48 ^🔗		GauntletW (~ted@[redacted]) has joined #internetarchive.bak
23:48 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
23:48 ^🔗		irc.Prison.NET gives channel operator status to db48x DFJustin
23:48 ^🔗		jake1 (~Adium@[redacted]) has joined #internetarchive.bak
23:48 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
23:49 ^🔗		db48x has quit (Ping timeout: 258 seconds)
23:50 ^🔗		midas1 (~midas@[redacted]) has joined #internetarchive.bak

irclogger-viewer