[01:02] *** garyrh_ gives channel operator status to closure garyrh ivan` Kenshin [01:03] closure: Good job [01:04] I'd like the default shard(?) to be 100gb, or a percentage of a typical small drive [01:04] For the record, I think the amount of stuff will be far under 20 petabytes [01:04] Our intental guy will have some good data on it this week. [01:11] So, with this, the question becomes.... what's missing? [01:11] I mean, a good leaderboard and view, of course. [01:11] But I have some extra disk space, as I'm sure others do, to donate to the cause. [01:40] SketchCow: each shard is split further amoung clients, so it can be larger than a typical small drive. A client could store only a few gb out of an 8 tb shard [01:41] I se. [01:41] See. [01:41] OK, withdrawn. [01:41] I have a range of questions, if you want them. [01:41] it's basically 1st come 1st serve which clients get which Items out of a shard [01:41] yeah, ask away [01:41] (Also, I'll redo the talk page to reflect a pushing to git-annex) [01:43] Goofy McAnderson has a drive dedicated to us. It's on his system, it's 500gb. [01:43] If he was to look in that drive's directory, what would he see? [01:43] bunch of $itemname.tar [01:44] some random or not so random subset of the IA items [01:44] Rounded to item? [01:44] So $itemname.tar is the full originals set of $itemname? [01:44] yeah, presumably w/o the derives [01:53] *** BEGIN LOGGING AT Tue Mar 3 20:53:14 2015 [01:53] *** Now talking on #internetarchive.bak [01:53] *** Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK [01:53] *** Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Sun Mar 1 23:43:37 2015 [01:53] *** svchfoo1 gives channel operator status to chfoo [01:54] That's a pretty neat approach to git annex for it, yeah. [01:54] (apologies for not having been around much otherwise: job interviewing. :)) [01:54] kind of a cool effect of distributed, forked git repos [01:55] or, he could rename it to "awesomhot.tar", and the IA wouldn't care, it can still restore the file from him despite the name change [01:57] does that capability fall out from the usual way git handles renames? [01:57] or is there more on top from git-annex [01:57] That's more a function of git-annex's storage of data. [01:57] it's basically due to git renames, yes [01:58] hmm neat [01:58] All that's in the git repo *itself* is just a symlink to the git-annex data store, but git annex doesn't really look at the symlinks to determine what the file is. [01:58] Yaaay, nice properties falling out naturally. [01:58] pikhq is right [02:00] I like that solution (nooo, hotbootydogporn) [02:04] These are all good. [02:04] I am running out of questions [02:04] Oh, and this can go into the wiki or your document [02:04] What is IA running? [02:05] Like, what do we need to be running? Another machine with git, git-annex, or whatever? [02:05] I mean, i sounds almost like we need to give you a box and let you start making it into god. [02:06] IA needs some kind of server, with git and git-annex. I'm assuming locked down ssh keys for clients to access it (can only run git-annex-shell to download data and do git pull/push), although that could be handled other ways. [02:06] How much disk space [02:06] And is it shoving out all the data? [02:06] Is it pulling from the items and constructing the love? [02:07] needs enough disk to buffer outgoing transfers to clients, and probably several gb for the git repos [02:07] I assume it's doing the client-facing transfer, I don't know about how the $item.tar gets made [02:09] could be one client per shard too, or something like that, depending on how some of these things scale [02:09] sorry, 1 server per shard [02:09] or per 10 or whatever [02:10] Yeah, might want to aws [02:10] hmm, here's one other thought.. the total number of files in all items in IA might be only say 10x the number of items. It might make sense to make the repos contain not $item.tar, but $item/$file [02:11] It's a strong idea. [02:11] it lets you play mp3 and movies w/o this tar thing that is so hard 35 years after being made ;) [02:12] (btw, I have a git-annex repo I made a while ago that contains the most popular 500 or so GD live recordings. Kind of amusing.) [02:13] I pulled out all the recordings of Dark Star. I think I could play them back to back for about 1 week.. [02:14] 119 gb [02:14] *** closure is only a baby deadhead [02:18] It might be worth doing for the initial. [02:18] And it's sexy, it solves the problem of IA completely craters into the earth [02:18] the drives just.... have files [02:18] Nice [02:19] git-annex.branchable.com/future_proofing [02:27] Yeah, that's probably my favorite feature of git-annex. [02:28] If git annex bites the dust somehow, it's just files. [02:48] updated document with several items [02:57] will the files be compresssed? [02:58] I was thinking not, but *shrug* could be [02:59] (assuming it uses ssh they'd be compressed in transit) [02:59] compression saves a lot on html, which is what most of the web archive would be [02:59] point [02:59] someone said something about bzip [03:00] except, is warc compressed? :) [03:00] with the amount of data you guys will be working with, you don't really have the option to not use compression [03:00] warc is commonly gzip compressed. [03:00] I don't know if the web archive is though. [03:01] if it's separate files, and not $Item.tar, it could decide on a per-file basis when adding it whether to compress, or leave a already compressed file format as-is [03:01] (nevermind, the bzip thing was something else) [03:02] how much low-entropy stuff would there be on IA, anyway? .warc.gz, audio, video are relatively incompresible [03:02] pdf, html, disk images, .. [03:03] there are compression algorithms that uncompress .zip or whatever and then recompress it, making a note to re-zip it when you uncompress [03:03] How expensive (Computer time, coder time) would it be to try compressing everything, or check if compression would have a benifit? [03:04] like if it's a warc, make sure it's a compressed warc, if it's a know video/image/audio file don't bother [03:04] Coder time, pretty easy. Compute time, :( [03:05] could you just have a script look to see if compression has been done, and then if it's a known compressable apply compression if needed? [03:06] you could just try compressing the first 1kB and see if it helps or not [03:07] My question is mostly how much git-annex would trust the clients. For example, if I claim I have the whole archive, does it have any realistic way of checking? Obviously I might need some metadata for that (hashes of each file, or whatever), but far less than actually having everything. [03:07] you could also run your shard on a compressed filesystem, take the complexity of compression entirely out of this syste [03:07] m [03:08] (See also Sybil attacks on multiple copies, etc.) [03:09] aschmitz: you ask the client for a salted hash [03:09] randomly request a 1M chunk? [03:09] aschmitz: that's a fun attack. :) The fire drill secion has one way to detect such bad actors, but it seems hard to know for sure, you have to decide how much you trust people and the system, and hope for enough redundancy .. [03:09] fenn: Yeah, I had proposed that before, and it seems like the only realistic way I can think of. Not sure if git-annex does that now, but I have to bet that closure could make it. :) [03:10] i have yet to hear any good solutions to sybil attacks (in general) and proof of storage [03:10] Salted hash of a random 1M chunk would suffice for detecting corruption, but yeah. That's not really a way of determining how much to trust a client but more whether or not a client has violated that trust. [03:10] git-annex doesn't have proof right now, other than trying to get the file that it claims to have [03:11] pikhq: Could do salted hash of the whole N GB chunk, as all you have to transfer is the hash. Would have to read that from the archive too, but whatever disk scrubbing IA does probably has to read everything regularly anyway. [03:12] however, systems that have a tit-for-tat incentive system need proof more than this system, which has a incentive of helping the IA [03:12] (right?) [03:12] fenn: That's a valid point. I suppose if you don't care that everyone is anonymous, you could register different people separately. [03:13] forcing people to register doesn't solve the sybil attack problem [03:13] closure: Yeah, there's no incentives to game here which helps a lot in terms of the odds of being attacked. [03:13] fenn: Depends on how thorough you are at identifying them. Having, say, a number of different universities register is something that could be verified that they're distinct, at least. Individuals would be a lot harder. [03:13] well, no incentive other than some random 4chan thread "let's kill the IA today because it's a wednesday" [03:13] but it may not matter anyway; bittorrent has various enemies and is also vulnerable to sybil attacks, but it still works fine [03:13] closure: That was the concern, yeah. [03:14] multiple tiers of trust [03:14] closure: Alternatively, someone could just target a small section of the data (say, furry art or something), claim they had several copies, and if the IA ever does become a crater, nobody else will have bothered to keep copies. [03:15] how would they "target" the data? [03:15] I don't necessarily have solutions, and git-annex is awesome, just trying to throw out potential issues. They're not all necessarily valid attacks, or worth defending against. [03:16] aschmitz: yeah. It is possible to prevent such targeting, but it adds quite a lot of complexity, and possibly decreases incentives for some good actors [03:16] fenn: Presumably they could "sample" a bunch of different chunks, then identify content they didn't like? WARCs would be pretty easy to identify, say, by domain name by just scanning a few kb of content. [03:16] ie, it could assign particular items at random to clients, and ignore clients who claim to have unassigned items [03:16] Sure. [03:17] or encrypt items.. [03:17] Aside: What happened to http://archive.org/about/bibalex_p_r.php ? [03:18] closure: Unfortunately, encryption is a pretty annoying single point of failure, and if the key(s) go away, all the data does. [03:18] yeah, that [03:19] I actually shard (SSS) my gpg key amoung many git-annex repos. Saved me losing it last month :) [03:19] N of M is the bomb [03:19] Nice. [03:20] Frankly probably the best thing for preventing these sorts of attacks is just making sure that enough good actors participate that these won't work. :) [03:20] Yeah, it's not like there's not precedent for doing that with keys (see: DNSSEC root), but I'm guessing we'd prefer to avoid having to deal with it. [03:21] the most likely attacks are not cryptographic or explosives, but legal actions [03:21] yeah, if there was a huge asshole contingent I'd guess we'd have seen it in the warrior projects [03:21] haven't seen that so far [03:21] like "cease and desist at once!" [03:21] yipdw: I'm still impressed you haven't. [03:21] Which, y'know good. [03:22] I am impressed too [03:22] Not very interesting to assholes. [03:22] yipdw: you forget when we HTML injection exploited the leaderboard? :) [03:22] It's like attacking an orphan's puppy. Just, why? [03:22] unfortunately you need orders of magnitude more participation (and thus attention, and unwanted attention) than the warrior projects [03:22] closure: oh yeah, there was that [03:23] fenn: True. [03:23] <-- hey there's always 1 asshole [03:23] My warrior instance I haven't paid attention to in months. [03:23] Still see it show up on leaderboards though. [03:23] pikhq: Might it be worth allowing some sort of automatic "I have these chunks" messages/something that can be shared among places that trust one another? I suspect many places that support LOCKSS would potentially dedicate some storage space, and be willing to trust one another and avoid duplicating effort unnecessarily. [03:23] "trust, but verify" [03:24] if there is a simple protocol to verify then there's no reason to blindly trust [03:24] Well, sure. [03:25] if N of M shards are needed to reassemble a decryption key, who owns the key, and how do they get it? [03:26] Whoever can get N shards :) [03:26] isn't this just DRM all over again? [03:26] (wasn't it proven that DRM can't work?) [03:26] Under such a scheme, you wouldn't actually let people decrypt the data unless the key were revealed (which would only happen when IA disappears), which technically works. [03:27] how would the key be revealed "when IA disappears" (whatever that means) [03:27] DRM relies on saying "you can see this data, but you have to stop when I tell you to". This would be "you can have this data, but can't decrypt it until I release the key". [03:27] Presumably a number of semi-trusted people would be given shards of the key, and N of M of them would have to agree. [03:28] also this sounds a lot like various video game quests :P [03:28] Note: I don't particularly like this idea, but I'm explaining how it would work. [03:28] Kill Bills 0..M-N [03:28] me neither, for the record [03:28] Three were intended for the Elves, Seven for Dwarves, Nine for Men, and one, the One Ring was given to 4chan [03:28] Which is to say: I don't think the crypto is really necessary. [03:29] On the other hand, Freenet seems to avoid some problems by not really letting anyone see what their computer is actually storing. Hopefully that wouldn't be an issue here, but I don't know how many threats IA gets, or how many individuals would be likely to get. [03:29] if you encrypt it you create a single point of failure [03:30] if someone controls the keys they can control the whole array [03:30] Ctrl-S: Not that I disagree, but we were at least discussing how to make it a N of M point of failure :) [03:30] And to be fair, the keys would only be used to obscure the data that was being stored, not for commanding the clients or anything. [03:31] i was thinking a different failure mode... the world blows up and nobody can read the ancient scrolls because they're encrypted [03:31] Yeah, that would also suck. [03:31] i was thinking of access to data, not C&C [03:32] if the keys are lost, the data is lost [03:32] so if you did have keys you'd need to spread them over the world [03:32] Ah, I was confused by your "can control the whole array" comment. Anyway, it doesn't seem like anyone likes the idea, so it doesn't seem worth going over too much. [03:32] i'm sure this conversation will come up again and again, with all the "dark" data in IA [03:32] if you are the only one with the keys, noone can access it without you [03:32] Actually, yeah, dark data might be interesting. [03:33] only encrypt dark data? [03:34] Ctrl-S: DNSSEC handled the "one key spread over the world" with their Trusted Community Representatives stuff: http://www.root-dnssec.org/index.html%3Fp=171.html . Ignoring everything else about DNSSEC, it seems like a reasonable proposal if you have to do that sort of thing. [03:34] there's something called "time lock encryption puzzles" where you basically just square a number repeatedly, and it has to be done in serial fashion, and it takes a lot of processor cycles, but not an unfeasible number of cycles [03:35] the idea is that someone can encrypt the data after crunching on it for an arbitrarily long time [03:35] *** Ctrl-S proposes the well established and highly secure ROT-13 crypto algorythim [03:35] decrypt* [03:35] fenn: I was actually looking into that for similar data, yeah. Unfortunately, you kind of have to leave something running doing the calculations to have the time lock expire at the right time, but I guess that's not a huge deal. [03:39] Ctrl-S: 3ROT-13, please. [04:31] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [04:38] Boop [04:38] *** bzc6p has quit (Ping timeout: 600 seconds) [04:40] Hi. So. [04:40] 1. I really don't want to encrypt. [04:43] 2. I am comfortable with, and happy with, git-annex's level of complexity and self-healing. [04:43] 3. There comes a point in the project when bad actors have to just be tolerated. [04:44] 4. There comes a point when you have to assume the bad actors making sybil attacks against shards are not ging to be able to touch the oirignals, hich have torrents [04:44] I think that we should move to a field test with closure and a selection of items. [04:45] or collections, really. [04:45] Works for me. [04:45] I think perhaps a aws system is the way to go. [04:45] Repeatable, we can mess with them [04:45] Use AWS bandwidth [04:45] Unless we want to start with archive.org internally. [04:45] I can get another server [05:08] LOCKSS has been mentioned a couple times, is it feasible to actually just use LOCKSS [05:10] My impression is that LOCKSS is basically just a caching proxy. I could be wrong, but if it is, probably not. [05:15] Apparently I'm somewhat wrong. You might be able to produce LOCKSS manifests for IA files, I guess, which might work. [05:15] Slightly more information and useful links at http://www.lockss.org/about/how-it-works/ [06:39] *** db48x` has quit (Read error: Operation timed out) [06:49] no. [06:52] something unrelated [06:55] SketchCow: i posted on -bs [10:37] *** bzc6p_ is now known as bzc6p [13:32] SketchCow: if aws is used, this would mean pumping the whole IA contents into aws and back out eventually. that's some BW cost [13:33] some vm like aws is proably ok for initial development [13:35] sketch:i can kinda provide resources u know [13:56] Kenshin: Appreciated. Yes, I forgot, the bandwidth [13:57] there was the other interesting topic in #archiveteam as well, about .onion site. heh [14:12] I saw. [14:22] *** trs80 has quit (Ping timeout: 186 seconds) [14:41] Kenshiin, how much can you throw somewhere near the US in disk space for this test backup? [14:44] u'd probably prefer LAX, i have a 10TB node there [14:44] it's 10ms from archive.org [15:05] Yes. [15:05] Well, for this test, assign 500gb to it initially. [15:05] I want to see it overflow, hit issues, etc [15:06] Otherwise, we're testing a butterfly against a tanker [15:22] k. i'll arrange something for you guys while you carry on hashing it out [15:27] *** Start has quit (Disconnected.) [15:29] cut the machine's power halfway through [16:02] *** Start (~Start@[redacted]) has joined #internetarchive.bak [16:51] *** Start has quit (Disconnected.) [16:58] *** Start (~Start@[redacted]) has joined #internetarchive.bak [17:21] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [17:21] I'll be making another machine with 500gb. If people have 500gb networked drives, that would help [17:21] Probably 5-10 would be a good number. [17:22] As mentioned by closure, git and git-annex to be on there. Maybe we need a wiki page with requirements. [17:26] *** bzc6p has quit (Ping timeout: 600 seconds) [17:35] I have to focus on my GDC presentation today, but I like where this is going, a lot. closure, just let us know what technology you need, and if there's code beyond what you would write to make it go. [17:45] *** Start has quit (Disconnected.) [18:03] *** Start (~Start@[redacted]) has joined #internetarchive.bak [18:31] SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do: [18:32] - pick a set of around 10 thousand items whose size sums to around 8 TB [18:33] - build map from Item to shard. Needs to scale well to 24+ million. sql? [18:35] - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW [18:36] - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?) [18:37] - client runtime environment (docker image maybe?) with warrior-like interface [18:37] (all that needs to do is configure things and get git-annex running) [18:38] could someone wiki that? ta [18:38] *** Start has quit (Disconnected.) [18:41] oh, getting a full item list with sizes and last modification time might be a good start too [18:42] closure: captured at http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation [19:45] *** Start (~Start@[redacted]) has joined #internetarchive.bak [19:46] *** bzc6p_ is now known as bzc6p [19:46] *** Start has quit (Read error: Connection reset by peer) [19:48] oh and if someone can get a count of all files in all items in the IA, that would be very useful information. Seems like an IA admin is best positioned to do that.. [20:35] *** Start (~Start@[redacted]) has joined #internetarchive.bak [21:23] *** Start has quit (Disconnected.) [22:31] *** sep332 (~sep332@[redacted]) has joined #internetarchive.bak [22:46] *** dirt (james@[redacted]) has joined #internetarchive.bak [22:58] *** garyrh_ has quit (Quit: Leaving) [23:01] *** jbenet (sid17552@[redacted]) has joined #internetarchive.bak [23:01] greetings-- saw the post on HN today. [23:02] i'm the author of ipfs.io -- i designed IPFS with the archive in mind. (see also end of https://www.youtube.com/watch?v=skMTdSEaCtA). [23:03] Our tech is very close to ready. you can read about the tech details here: http://static.benet.ai/t/ipfs.pdf [23:03] or watch the old talk here: https://www.youtube.com/watch?v=Fa4pckodM9g -- i will be doing another, updated tech dive into the protocol + details. [23:04] you can loosely think of ipfs as git + bittorrent + dht + web. [23:05] hmmm [23:05] huh I didn't know someone posted this on HN [23:05] my thoughts too [23:06] https://news.ycombinator.com/item?id=9147719 [23:06] cool, nobody writing about how stupid we all are yet [23:06] i'll wait a few more hours [23:06] hahahah [23:07] jbenet: feel free to add your solution in the wiki discussion page [23:07] i've been trying to get in touch with you about this-- i've been to a friday lunch (virgil griffith brought me) and recently reached out to brewster. i think you'll find that ipfs will very neatly plug into your arch, and does a ton of heavy lifting. it's not perfect yet -- keep in mind there was no code a few months ago -- but today we're at a point of [23:07] streaming video reliably and with no noticeable lag-- which is enough perf for replicating the archive. [23:08] --and before you use it, we've to put in the `commit` datastructure (so you can have proper version control like git-- [23:08] but basically, we're at a point where figuring out your exact constraints-- as they would look with ipfs-- would help us build the thing you need. [23:09] yipdw: that would be me... a month ago https://news.ycombinator.com/item?id=8980154 [23:09] been meaning to look into ipfs.. [23:09] ha [23:10] jbenet: i should point out that archiveteam is not the internet archive, and only one or two people here are associated with them [23:10] we just have a good working relationship with them [23:10] xmc: ah, thank you for pointing that out. [23:10] :) [23:10] sure thing [23:10] xmc: not hyper clear from looking at a page for 20s [23:10] :] [23:10] no worries [23:11] it's a common mistake [23:13] yeah, the single page http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK doesnt make it clear-- but then again it's a wiki and we should click home. [23:13] well [23:13] in any case-- now you know about ipfs :) look into it, i'm sure it'll be useful in this endeavor and we're happy to help. (#ipfs on freenode) [23:14] xmc: does the archive have an irc channel? [23:14] not officially [23:14] *** X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak [23:14] there is #internetarchive on this network though [23:15] it's most of the same people as in here [23:15] #archiveteam is the main channel for archiveteam [23:15] surprisingly enough [23:16] cool, thanks! [23:18] trying to put the disclaimer but the wiki is being hammered [23:19] *** mntasauri (~motesorri@[redacted]) has joined #internetarchive.bak [23:21] *** z0ner (0c118402@[redacted]) has joined #internetarchive.bak [23:23] *** z0ner has quit (Client Quit) [23:24] *** z0nenet (0c118402@[redacted]) has joined #internetarchive.bak [23:24] *** z0nenet has quit (Client Quit) [23:24] *** z0ned (webchat@[redacted]) has joined #internetarchive.bak [23:29] So, what's the plan!? [23:30] *** z0ned has quit (Quit: Page closed) [23:30] uh [23:38] *** chfoo has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam [23:41] so I threw a bit about IA's data model and browsing tools in http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation#Browsing_the_Internet_Archive [23:41] I'm not sure if 'ia search 'collection:*'' is a good idea, but it seems to work if you disregard that it might be killing a search server somewhere [23:45] is joey from git-annex in here? [23:46] jbenet: yes, he goes by the name closure [23:46] closure: is it you? (guessing from the irc note) [23:46] great [23:48] zooko was here earlier too [23:51] chfoo: lol the post brought all the fs nuts out of the woodwork :) [23:52] i'll stick around if you dont mind. i can also leave, whatever. [23:52] jbenet: yeah, sticking around is totally cool [23:54] *** GauntletW (~ted@[redacted]) has joined #internetarchive.bak [23:57] *** Start (~Start@[redacted]) has joined #internetarchive.bak [23:57] *** svchfoo1 gives channel operator status to Start [23:58] *** rossdylan (~rossdylan@[redacted]) has joined #internetarchive.bak [23:59] *** ryang (uid10904@[redacted]) has joined #internetarchive.bak [23:59] which fs does zooko work with [23:59] tahoe-lafs [23:59] tahoe ah