#internetarchive.bak 2015-03-04,Wed

↑back Search

Time	Nickname	Message
01:02 ^🔗		garyrh_ gives channel operator status to closure garyrh ivan` Kenshin
01:03 ^🔗	SketchCow	closure: Good job
01:04 ^🔗	SketchCow	I'd like the default shard(?) to be 100gb, or a percentage of a typical small drive
01:04 ^🔗	SketchCow	For the record, I think the amount of stuff will be far under 20 petabytes
01:04 ^🔗	SketchCow	Our intental guy will have some good data on it this week.
01:11 ^🔗	SketchCow	So, with this, the question becomes.... what's missing?
01:11 ^🔗	SketchCow	I mean, a good leaderboard and view, of course.
01:11 ^🔗	SketchCow	But I have some extra disk space, as I'm sure others do, to donate to the cause.
01:40 ^🔗	closure	SketchCow: each shard is split further amoung clients, so it can be larger than a typical small drive. A client could store only a few gb out of an 8 tb shard
01:41 ^🔗	SketchCow	I se.
01:41 ^🔗	SketchCow	See.
01:41 ^🔗	SketchCow	OK, withdrawn.
01:41 ^🔗	SketchCow	I have a range of questions, if you want them.
01:41 ^🔗	closure	it's basically 1st come 1st serve which clients get which Items out of a shard
01:41 ^🔗	closure	yeah, ask away
01:41 ^🔗	SketchCow	(Also, I'll redo the talk page to reflect a pushing to git-annex)
01:43 ^🔗	SketchCow	Goofy McAnderson has a drive dedicated to us. It's on his system, it's 500gb.
01:43 ^🔗	SketchCow	If he was to look in that drive's directory, what would he see?
01:43 ^🔗	closure	bunch of $itemname.tar
01:44 ^🔗	closure	some random or not so random subset of the IA items
01:44 ^🔗	SketchCow	Rounded to item?
01:44 ^🔗	SketchCow	So $itemname.tar is the full originals set of $itemname?
01:44 ^🔗	closure	yeah, presumably w/o the derives
01:53 ^🔗		BEGIN LOGGING AT Tue Mar 3 20:53:14 2015
01:53 ^🔗		Now talking on #internetarchive.bak
01:53 ^🔗		Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK
01:53 ^🔗		Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Sun Mar 1 23:43:37 2015
01:53 ^🔗		svchfoo1 gives channel operator status to chfoo
01:54 ^🔗	pikhq	That's a pretty neat approach to git annex for it, yeah.
01:54 ^🔗	pikhq	(apologies for not having been around much otherwise: job interviewing. :))
01:54 ^🔗	closure	kind of a cool effect of distributed, forked git repos
01:55 ^🔗	closure	or, he could rename it to "awesomhot.tar", and the IA wouldn't care, it can still restore the file from him despite the name change
01:57 ^🔗	yipdw	does that capability fall out from the usual way git handles renames?
01:57 ^🔗	yipdw	or is there more on top from git-annex
01:57 ^🔗	pikhq	That's more a function of git-annex's storage of data.
01:57 ^🔗	closure	it's basically due to git renames, yes
01:58 ^🔗	yipdw	hmm neat
01:58 ^🔗	pikhq	All that's in the git repo itself is just a symlink to the git-annex data store, but git annex doesn't really look at the symlinks to determine what the file is.
01:58 ^🔗	pikhq	Yaaay, nice properties falling out naturally.
01:58 ^🔗	closure	pikhq is right
02:00 ^🔗	SketchCow	I like that solution (nooo, hotbootydogporn)
02:04 ^🔗	SketchCow	These are all good.
02:04 ^🔗	SketchCow	I am running out of questions
02:04 ^🔗	SketchCow	Oh, and this can go into the wiki or your document
02:04 ^🔗	SketchCow	What is IA running?
02:05 ^🔗	SketchCow	Like, what do we need to be running? Another machine with git, git-annex, or whatever?
02:05 ^🔗	SketchCow	I mean, i sounds almost like we need to give you a box and let you start making it into god.
02:06 ^🔗	closure	IA needs some kind of server, with git and git-annex. I'm assuming locked down ssh keys for clients to access it (can only run git-annex-shell to download data and do git pull/push), although that could be handled other ways.
02:06 ^🔗	SketchCow	How much disk space
02:06 ^🔗	SketchCow	And is it shoving out all the data?
02:06 ^🔗	SketchCow	Is it pulling from the items and constructing the love?
02:07 ^🔗	closure	needs enough disk to buffer outgoing transfers to clients, and probably several gb for the git repos
02:07 ^🔗	closure	I assume it's doing the client-facing transfer, I don't know about how the $item.tar gets made
02:09 ^🔗	closure	could be one client per shard too, or something like that, depending on how some of these things scale
02:09 ^🔗	closure	sorry, 1 server per shard
02:09 ^🔗	closure	or per 10 or whatever
02:10 ^🔗	SketchCow	Yeah, might want to aws
02:10 ^🔗	closure	hmm, here's one other thought.. the total number of files in all items in IA might be only say 10x the number of items. It might make sense to make the repos contain not $item.tar, but $item/$file
02:11 ^🔗	SketchCow	It's a strong idea.
02:11 ^🔗	closure	it lets you play mp3 and movies w/o this tar thing that is so hard 35 years after being made ;)
02:12 ^🔗	closure	(btw, I have a git-annex repo I made a while ago that contains the most popular 500 or so GD live recordings. Kind of amusing.)
02:13 ^🔗	closure	I pulled out all the recordings of Dark Star. I think I could play them back to back for about 1 week..
02:14 ^🔗	closure	119 gb
02:14 ^🔗		closure is only a baby deadhead
02:18 ^🔗	SketchCow	It might be worth doing for the initial.
02:18 ^🔗	SketchCow	And it's sexy, it solves the problem of IA completely craters into the earth
02:18 ^🔗	SketchCow	the drives just.... have files
02:18 ^🔗	SketchCow	Nice
02:19 ^🔗	closure	git-annex.branchable.com/future_proofing
02:27 ^🔗	pikhq	Yeah, that's probably my favorite feature of git-annex.
02:28 ^🔗	pikhq	If git annex bites the dust somehow, it's just files.
02:48 ^🔗	closure	updated document with several items
02:57 ^🔗	Ctrl-S	will the files be compresssed?
02:58 ^🔗	closure	I was thinking not, but shrug could be
02:59 ^🔗	closure	(assuming it uses ssh they'd be compressed in transit)
02:59 ^🔗	fenn	compression saves a lot on html, which is what most of the web archive would be
02:59 ^🔗	closure	point
02:59 ^🔗	fenn	someone said something about bzip
03:00 ^🔗	closure	except, is warc compressed? :)
03:00 ^🔗	Ctrl-S	with the amount of data you guys will be working with, you don't really have the option to not use compression
03:00 ^🔗	pikhq	warc is commonly gzip compressed.
03:00 ^🔗	pikhq	I don't know if the web archive is though.
03:01 ^🔗	closure	if it's separate files, and not $Item.tar, it could decide on a per-file basis when adding it whether to compress, or leave a already compressed file format as-is
03:01 ^🔗	fenn	(nevermind, the bzip thing was something else)
03:02 ^🔗	ivan`	how much low-entropy stuff would there be on IA, anyway? .warc.gz, audio, video are relatively incompresible
03:02 ^🔗	closure	pdf, html, disk images, ..
03:03 ^🔗	fenn	there are compression algorithms that uncompress .zip or whatever and then recompress it, making a note to re-zip it when you uncompress
03:03 ^🔗	Ctrl-S	How expensive (Computer time, coder time) would it be to try compressing everything, or check if compression would have a benifit?
03:04 ^🔗	Ctrl-S	like if it's a warc, make sure it's a compressed warc, if it's a know video/image/audio file don't bother
03:04 ^🔗	pikhq	Coder time, pretty easy. Compute time, :(
03:05 ^🔗	Ctrl-S	could you just have a script look to see if compression has been done, and then if it's a known compressable apply compression if needed?
03:06 ^🔗	fenn	you could just try compressing the first 1kB and see if it helps or not
03:07 ^🔗	aschmitz	My question is mostly how much git-annex would trust the clients. For example, if I claim I have the whole archive, does it have any realistic way of checking? Obviously I might need some metadata for that (hashes of each file, or whatever), but far less than actually having everything.
03:07 ^🔗	yipdw	you could also run your shard on a compressed filesystem, take the complexity of compression entirely out of this syste
03:07 ^🔗	yipdw	m
03:08 ^🔗	aschmitz	(See also Sybil attacks on multiple copies, etc.)
03:09 ^🔗	fenn	aschmitz: you ask the client for a salted hash
03:09 ^🔗	Ctrl-S	randomly request a 1M chunk?
03:09 ^🔗	closure	aschmitz: that's a fun attack. :) The fire drill secion has one way to detect such bad actors, but it seems hard to know for sure, you have to decide how much you trust people and the system, and hope for enough redundancy ..
03:09 ^🔗	aschmitz	fenn: Yeah, I had proposed that before, and it seems like the only realistic way I can think of. Not sure if git-annex does that now, but I have to bet that closure could make it. :)
03:10 ^🔗	fenn	i have yet to hear any good solutions to sybil attacks (in general) and proof of storage
03:10 ^🔗	pikhq	Salted hash of a random 1M chunk would suffice for detecting corruption, but yeah. That's not really a way of determining how much to trust a client but more whether or not a client has violated that trust.
03:10 ^🔗	closure	git-annex doesn't have proof right now, other than trying to get the file that it claims to have
03:11 ^🔗	aschmitz	pikhq: Could do salted hash of the whole N GB chunk, as all you have to transfer is the hash. Would have to read that from the archive too, but whatever disk scrubbing IA does probably has to read everything regularly anyway.
03:12 ^🔗	closure	however, systems that have a tit-for-tat incentive system need proof more than this system, which has a incentive of helping the IA
03:12 ^🔗	closure	(right?)
03:12 ^🔗	aschmitz	fenn: That's a valid point. I suppose if you don't care that everyone is anonymous, you could register different people separately.
03:13 ^🔗	fenn	forcing people to register doesn't solve the sybil attack problem
03:13 ^🔗	pikhq	closure: Yeah, there's no incentives to game here which helps a lot in terms of the odds of being attacked.
03:13 ^🔗	aschmitz	fenn: Depends on how thorough you are at identifying them. Having, say, a number of different universities register is something that could be verified that they're distinct, at least. Individuals would be a lot harder.
03:13 ^🔗	closure	well, no incentive other than some random 4chan thread "let's kill the IA today because it's a wednesday"
03:13 ^🔗	fenn	but it may not matter anyway; bittorrent has various enemies and is also vulnerable to sybil attacks, but it still works fine
03:13 ^🔗	aschmitz	closure: That was the concern, yeah.
03:14 ^🔗	Ctrl-S	multiple tiers of trust
03:14 ^🔗	aschmitz	closure: Alternatively, someone could just target a small section of the data (say, furry art or something), claim they had several copies, and if the IA ever does become a crater, nobody else will have bothered to keep copies.
03:15 ^🔗	fenn	how would they "target" the data?
03:15 ^🔗	aschmitz	I don't necessarily have solutions, and git-annex is awesome, just trying to throw out potential issues. They're not all necessarily valid attacks, or worth defending against.
03:16 ^🔗	closure	aschmitz: yeah. It is possible to prevent such targeting, but it adds quite a lot of complexity, and possibly decreases incentives for some good actors
03:16 ^🔗	aschmitz	fenn: Presumably they could "sample" a bunch of different chunks, then identify content they didn't like? WARCs would be pretty easy to identify, say, by domain name by just scanning a few kb of content.
03:16 ^🔗	closure	ie, it could assign particular items at random to clients, and ignore clients who claim to have unassigned items
03:16 ^🔗	aschmitz	Sure.
03:17 ^🔗	closure	or encrypt items..
03:17 ^🔗	aschmitz	Aside: What happened to http://archive.org/about/bibalex_p_r.php ?
03:18 ^🔗	aschmitz	closure: Unfortunately, encryption is a pretty annoying single point of failure, and if the key(s) go away, all the data does.
03:18 ^🔗	fenn	yeah, that
03:19 ^🔗	closure	I actually shard (SSS) my gpg key amoung many git-annex repos. Saved me losing it last month :)
03:19 ^🔗	closure	N of M is the bomb
03:19 ^🔗	aschmitz	Nice.
03:20 ^🔗	pikhq	Frankly probably the best thing for preventing these sorts of attacks is just making sure that enough good actors participate that these won't work. :)
03:20 ^🔗	aschmitz	Yeah, it's not like there's not precedent for doing that with keys (see: DNSSEC root), but I'm guessing we'd prefer to avoid having to deal with it.
03:21 ^🔗	fenn	the most likely attacks are not cryptographic or explosives, but legal actions
03:21 ^🔗	yipdw	yeah, if there was a huge asshole contingent I'd guess we'd have seen it in the warrior projects
03:21 ^🔗	yipdw	haven't seen that so far
03:21 ^🔗	fenn	like "cease and desist at once!"
03:21 ^🔗	aschmitz	yipdw: I'm still impressed you haven't.
03:21 ^🔗	aschmitz	Which, y'know good.
03:22 ^🔗	yipdw	I am impressed too
03:22 ^🔗	pikhq	Not very interesting to assholes.
03:22 ^🔗	closure	yipdw: you forget when we HTML injection exploited the leaderboard? :)
03:22 ^🔗	pikhq	It's like attacking an orphan's puppy. Just, why?
03:22 ^🔗	fenn	unfortunately you need orders of magnitude more participation (and thus attention, and unwanted attention) than the warrior projects
03:22 ^🔗	yipdw	closure: oh yeah, there was that
03:23 ^🔗	pikhq	fenn: True.
03:23 ^🔗	closure	<-- hey there's always 1 asshole
03:23 ^🔗	pikhq	My warrior instance I haven't paid attention to in months.
03:23 ^🔗	pikhq	Still see it show up on leaderboards though.
03:23 ^🔗	aschmitz	pikhq: Might it be worth allowing some sort of automatic "I have these chunks" messages/something that can be shared among places that trust one another? I suspect many places that support LOCKSS would potentially dedicate some storage space, and be willing to trust one another and avoid duplicating effort unnecessarily.
03:23 ^🔗	fenn	"trust, but verify"
03:24 ^🔗	fenn	if there is a simple protocol to verify then there's no reason to blindly trust
03:24 ^🔗	aschmitz	Well, sure.
03:25 ^🔗	fenn	if N of M shards are needed to reassemble a decryption key, who owns the key, and how do they get it?
03:26 ^🔗	aschmitz	Whoever can get N shards :)
03:26 ^🔗	fenn	isn't this just DRM all over again?
03:26 ^🔗	fenn	(wasn't it proven that DRM can't work?)
03:26 ^🔗	aschmitz	Under such a scheme, you wouldn't actually let people decrypt the data unless the key were revealed (which would only happen when IA disappears), which technically works.
03:27 ^🔗	fenn	how would the key be revealed "when IA disappears" (whatever that means)
03:27 ^🔗	aschmitz	DRM relies on saying "you can see this data, but you have to stop when I tell you to". This would be "you can have this data, but can't decrypt it until I release the key".
03:27 ^🔗	aschmitz	Presumably a number of semi-trusted people would be given shards of the key, and N of M of them would have to agree.
03:28 ^🔗	fenn	also this sounds a lot like various video game quests :P
03:28 ^🔗	aschmitz	Note: I don't particularly like this idea, but I'm explaining how it would work.
03:28 ^🔗	closure	Kill Bills 0..M-N
03:28 ^🔗	closure	me neither, for the record
03:28 ^🔗	fenn	Three were intended for the Elves, Seven for Dwarves, Nine for Men, and one, the One Ring was given to 4chan
03:28 ^🔗	aschmitz	Which is to say: I don't think the crypto is really necessary.
03:29 ^🔗	aschmitz	On the other hand, Freenet seems to avoid some problems by not really letting anyone see what their computer is actually storing. Hopefully that wouldn't be an issue here, but I don't know how many threats IA gets, or how many individuals would be likely to get.
03:29 ^🔗	Ctrl-S	if you encrypt it you create a single point of failure
03:30 ^🔗	Ctrl-S	if someone controls the keys they can control the whole array
03:30 ^🔗	aschmitz	Ctrl-S: Not that I disagree, but we were at least discussing how to make it a N of M point of failure :)
03:30 ^🔗	aschmitz	And to be fair, the keys would only be used to obscure the data that was being stored, not for commanding the clients or anything.
03:31 ^🔗	fenn	i was thinking a different failure mode... the world blows up and nobody can read the ancient scrolls because they're encrypted
03:31 ^🔗	aschmitz	Yeah, that would also suck.
03:31 ^🔗	Ctrl-S	i was thinking of access to data, not C&C
03:32 ^🔗	Ctrl-S	if the keys are lost, the data is lost
03:32 ^🔗	Ctrl-S	so if you did have keys you'd need to spread them over the world
03:32 ^🔗	aschmitz	Ah, I was confused by your "can control the whole array" comment. Anyway, it doesn't seem like anyone likes the idea, so it doesn't seem worth going over too much.
03:32 ^🔗	fenn	i'm sure this conversation will come up again and again, with all the "dark" data in IA
03:32 ^🔗	Ctrl-S	if you are the only one with the keys, noone can access it without you
03:32 ^🔗	aschmitz	Actually, yeah, dark data might be interesting.
03:33 ^🔗	Ctrl-S	only encrypt dark data?
03:34 ^🔗	aschmitz	Ctrl-S: DNSSEC handled the "one key spread over the world" with their Trusted Community Representatives stuff: http://www.root-dnssec.org/index.html%3Fp=171.html . Ignoring everything else about DNSSEC, it seems like a reasonable proposal if you have to do that sort of thing.
03:34 ^🔗	fenn	there's something called "time lock encryption puzzles" where you basically just square a number repeatedly, and it has to be done in serial fashion, and it takes a lot of processor cycles, but not an unfeasible number of cycles
03:35 ^🔗	fenn	the idea is that someone can encrypt the data after crunching on it for an arbitrarily long time
03:35 ^🔗		Ctrl-S proposes the well established and highly secure ROT-13 crypto algorythim
03:35 ^🔗	fenn	decrypt*
03:35 ^🔗	aschmitz	fenn: I was actually looking into that for similar data, yeah. Unfortunately, you kind of have to leave something running doing the calculations to have the time lock expire at the right time, but I guess that's not a huge deal.
03:39 ^🔗	pikhq	Ctrl-S: 3ROT-13, please.
04:31 ^🔗		bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
04:38 ^🔗	SketchCow	Boop
04:38 ^🔗		bzc6p has quit (Ping timeout: 600 seconds)
04:40 ^🔗	SketchCow	Hi. So.
04:40 ^🔗	SketchCow	1. I really don't want to encrypt.
04:43 ^🔗	SketchCow	2. I am comfortable with, and happy with, git-annex's level of complexity and self-healing.
04:43 ^🔗	SketchCow	3. There comes a point in the project when bad actors have to just be tolerated.
04:44 ^🔗	SketchCow	4. There comes a point when you have to assume the bad actors making sybil attacks against shards are not ging to be able to touch the oirignals, hich have torrents
04:44 ^🔗	SketchCow	I think that we should move to a field test with closure and a selection of items.
04:45 ^🔗	SketchCow	or collections, really.
04:45 ^🔗	aschmitz	Works for me.
04:45 ^🔗	SketchCow	I think perhaps a aws system is the way to go.
04:45 ^🔗	SketchCow	Repeatable, we can mess with them
04:45 ^🔗	SketchCow	Use AWS bandwidth
04:45 ^🔗	SketchCow	Unless we want to start with archive.org internally.
04:45 ^🔗	SketchCow	I can get another server
05:08 ^🔗	DFJustin	LOCKSS has been mentioned a couple times, is it feasible to actually just use LOCKSS
05:10 ^🔗	aschmitz	My impression is that LOCKSS is basically just a caching proxy. I could be wrong, but if it is, probably not.
05:15 ^🔗	aschmitz	Apparently I'm somewhat wrong. You might be able to produce LOCKSS manifests for IA files, I guess, which might work.
05:15 ^🔗	aschmitz	Slightly more information and useful links at http://www.lockss.org/about/how-it-works/
06:39 ^🔗		db48x` has quit (Read error: Operation timed out)
06:49 ^🔗	SketchCow	no.
06:52 ^🔗	godane	something unrelated
06:55 ^🔗	godane	SketchCow: i posted on -bs
10:37 ^🔗		bzc6p_ is now known as bzc6p
13:32 ^🔗	closure	SketchCow: if aws is used, this would mean pumping the whole IA contents into aws and back out eventually. that's some BW cost
13:33 ^🔗	closure	some vm like aws is proably ok for initial development
13:35 ^🔗	Kenshin	sketch:i can kinda provide resources u know
13:56 ^🔗	SketchCow	Kenshin: Appreciated. Yes, I forgot, the bandwidth
13:57 ^🔗	Kenshin	there was the other interesting topic in #archiveteam as well, about .onion site. heh
14:12 ^🔗	SketchCow	I saw.
14:22 ^🔗		trs80 has quit (Ping timeout: 186 seconds)
14:41 ^🔗	SketchCow	Kenshiin, how much can you throw somewhere near the US in disk space for this test backup?
14:44 ^🔗	Kenshin	u'd probably prefer LAX, i have a 10TB node there
14:44 ^🔗	Kenshin	it's 10ms from archive.org
15:05 ^🔗	SketchCow	Yes.
15:05 ^🔗	SketchCow	Well, for this test, assign 500gb to it initially.
15:05 ^🔗	SketchCow	I want to see it overflow, hit issues, etc
15:06 ^🔗	SketchCow	Otherwise, we're testing a butterfly against a tanker
15:22 ^🔗	Kenshin	k. i'll arrange something for you guys while you carry on hashing it out
15:27 ^🔗		Start has quit (Disconnected.)
15:29 ^🔗	Ctrl-S	cut the machine's power halfway through
16:02 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
16:51 ^🔗		Start has quit (Disconnected.)
16:58 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
17:21 ^🔗		bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
17:21 ^🔗	SketchCow	I'll be making another machine with 500gb. If people have 500gb networked drives, that would help
17:21 ^🔗	SketchCow	Probably 5-10 would be a good number.
17:22 ^🔗	SketchCow	As mentioned by closure, git and git-annex to be on there. Maybe we need a wiki page with requirements.
17:26 ^🔗		bzc6p has quit (Ping timeout: 600 seconds)
17:35 ^🔗	SketchCow	I have to focus on my GDC presentation today, but I like where this is going, a lot. closure, just let us know what technology you need, and if there's code beyond what you would write to make it go.
17:45 ^🔗		Start has quit (Disconnected.)
18:03 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
18:31 ^🔗	closure	SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:
18:32 ^🔗	closure	- pick a set of around 10 thousand items whose size sums to around 8 TB
18:33 ^🔗	closure	- build map from Item to shard. Needs to scale well to 24+ million. sql?
18:35 ^🔗	closure	- write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW
18:36 ^🔗	closure	- write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
18:37 ^🔗	closure	- client runtime environment (docker image maybe?) with warrior-like interface
18:37 ^🔗	closure	(all that needs to do is configure things and get git-annex running)
18:38 ^🔗	closure	could someone wiki that? ta
18:38 ^🔗		Start has quit (Disconnected.)
18:41 ^🔗	closure	oh, getting a full item list with sizes and last modification time might be a good start too
18:42 ^🔗	yipdw	closure: captured at http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation
19:45 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
19:46 ^🔗		bzc6p_ is now known as bzc6p
19:46 ^🔗		Start has quit (Read error: Connection reset by peer)
19:48 ^🔗	closure	oh and if someone can get a count of all files in all items in the IA, that would be very useful information. Seems like an IA admin is best positioned to do that..
20:35 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
21:23 ^🔗		Start has quit (Disconnected.)
22:31 ^🔗		sep332 (~sep332@[redacted]) has joined #internetarchive.bak
22:46 ^🔗		dirt (james@[redacted]) has joined #internetarchive.bak
22:58 ^🔗		garyrh_ has quit (Quit: Leaving)
23:01 ^🔗		jbenet (sid17552@[redacted]) has joined #internetarchive.bak
23:01 ^🔗	jbenet	greetings-- saw the post on HN today.
23:02 ^🔗	jbenet	i'm the author of ipfs.io -- i designed IPFS with the archive in mind. (see also end of https://www.youtube.com/watch?v=skMTdSEaCtA).
23:03 ^🔗	jbenet	Our tech is very close to ready. you can read about the tech details here: http://static.benet.ai/t/ipfs.pdf
23:03 ^🔗	jbenet	or watch the old talk here: https://www.youtube.com/watch?v=Fa4pckodM9g -- i will be doing another, updated tech dive into the protocol + details.
23:04 ^🔗	jbenet	you can loosely think of ipfs as git + bittorrent + dht + web.
23:05 ^🔗	xmc	hmmm
23:05 ^🔗	yipdw	huh I didn't know someone posted this on HN
23:05 ^🔗	xmc	my thoughts too
23:06 ^🔗	chfoo	https://news.ycombinator.com/item?id=9147719
23:06 ^🔗	yipdw	cool, nobody writing about how stupid we all are yet
23:06 ^🔗	yipdw	i'll wait a few more hours
23:06 ^🔗	xmc	hahahah
23:07 ^🔗	chfoo	jbenet: feel free to add your solution in the wiki discussion page
23:07 ^🔗	jbenet	i've been trying to get in touch with you about this-- i've been to a friday lunch (virgil griffith brought me) and recently reached out to brewster. i think you'll find that ipfs will very neatly plug into your arch, and does a ton of heavy lifting. it's not perfect yet -- keep in mind there was no code a few months ago -- but today we're at a point of
23:07 ^🔗	jbenet	streaming video reliably and with no noticeable lag-- which is enough perf for replicating the archive.
23:08 ^🔗	jbenet	--and before you use it, we've to put in the `commit` datastructure (so you can have proper version control like git--
23:08 ^🔗	jbenet	but basically, we're at a point where figuring out your exact constraints-- as they would look with ipfs-- would help us build the thing you need.
23:09 ^🔗	ivan`	yipdw: that would be me... a month ago https://news.ycombinator.com/item?id=8980154
23:09 ^🔗	closure	been meaning to look into ipfs..
23:09 ^🔗	yipdw	ha
23:10 ^🔗	xmc	jbenet: i should point out that archiveteam is not the internet archive, and only one or two people here are associated with them
23:10 ^🔗	xmc	we just have a good working relationship with them
23:10 ^🔗	jbenet	xmc: ah, thank you for pointing that out.
23:10 ^🔗	xmc	:)
23:10 ^🔗	xmc	sure thing
23:10 ^🔗	jbenet	xmc: not hyper clear from looking at a page for 20s
23:10 ^🔗	jbenet	:]
23:10 ^🔗	xmc	no worries
23:11 ^🔗	xmc	it's a common mistake
23:13 ^🔗	jbenet	yeah, the single page http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK doesnt make it clear-- but then again it's a wiki and we should click home.
23:13 ^🔗	jbenet	well
23:13 ^🔗	jbenet	in any case-- now you know about ipfs :) look into it, i'm sure it'll be useful in this endeavor and we're happy to help. (#ipfs on freenode)
23:14 ^🔗	jbenet	xmc: does the archive have an irc channel?
23:14 ^🔗	xmc	not officially
23:14 ^🔗		X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
23:14 ^🔗	xmc	there is #internetarchive on this network though
23:15 ^🔗	xmc	it's most of the same people as in here
23:15 ^🔗	xmc	#archiveteam is the main channel for archiveteam
23:15 ^🔗	xmc	surprisingly enough
23:16 ^🔗	jbenet	cool, thanks!
23:18 ^🔗	chfoo	trying to put the disclaimer but the wiki is being hammered
23:19 ^🔗		mntasauri (~motesorri@[redacted]) has joined #internetarchive.bak
23:21 ^🔗		z0ner (0c118402@[redacted]) has joined #internetarchive.bak
23:23 ^🔗		z0ner has quit (Client Quit)
23:24 ^🔗		z0nenet (0c118402@[redacted]) has joined #internetarchive.bak
23:24 ^🔗		z0nenet has quit (Client Quit)
23:24 ^🔗		z0ned (webchat@[redacted]) has joined #internetarchive.bak
23:29 ^🔗	z0ned	So, what's the plan!?
23:30 ^🔗		z0ned has quit (Quit: Page closed)
23:30 ^🔗	xmc	uh
23:38 ^🔗		chfoo has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK \| #archiveteam
23:41 ^🔗	yipdw	so I threw a bit about IA's data model and browsing tools in http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation#Browsing_the_Internet_Archive
23:41 ^🔗	yipdw	I'm not sure if 'ia search 'collection:*'' is a good idea, but it seems to work if you disregard that it might be killing a search server somewhere
23:45 ^🔗	jbenet	is joey from git-annex in here?
23:46 ^🔗	xmc	jbenet: yes, he goes by the name closure
23:46 ^🔗	jbenet	closure: is it you? (guessing from the irc note)
23:46 ^🔗	jbenet	great
23:48 ^🔗	chfoo	zooko was here earlier too
23:51 ^🔗	jbenet	chfoo: lol the post brought all the fs nuts out of the woodwork :)
23:52 ^🔗	jbenet	i'll stick around if you dont mind. i can also leave, whatever.
23:52 ^🔗	yipdw	jbenet: yeah, sticking around is totally cool
23:54 ^🔗		GauntletW (~ted@[redacted]) has joined #internetarchive.bak
23:57 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
23:57 ^🔗		svchfoo1 gives channel operator status to Start
23:58 ^🔗		rossdylan (~rossdylan@[redacted]) has joined #internetarchive.bak
23:59 ^🔗		ryang (uid10904@[redacted]) has joined #internetarchive.bak
23:59 ^🔗	mntasauri	which fs does zooko work with
23:59 ^🔗	xmc	tahoe-lafs
23:59 ^🔗	mntasauri	tahoe ah

irclogger-viewer