[00:06] rig everything works now, got the three on the wiki going. It's about 1 item per second so when i wake up in 6hours they should be done getting to the movies one after that [00:09] the script can be grabbed here: https://gist.github.com/f77f094032110a7b51e7.git [00:10] no wait let me change the name of that first [00:11] done [00:11] just run with `python ia-colletion-size.py ` [00:12] i'm doing ephemera, computermagazines and softwarelibrary [00:43] *** bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak [00:45] *** zooko (~user@[redacted]) has joined #internetarchive.bak [00:45] Hi folks! I'm here because fenn told me about http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK and I'm one of the authors of Tahoe-LAFS. [00:47] *** garyrh (garyrh@[redacted]) has joined #internetarchive.bak [00:49] *** bzc6p_ has quit (Ping timeout: 600 seconds) [00:52] *** S[h]O[r]T (omgitsme@[redacted]) has joined #internetarchive.bak [00:56] *** Rotab has quit (hub.se irc.du.se) [01:17] hi zooko [01:22] twph... I suspect faster metjods exist but we will stick with this for now [01:22] ia mine does parallel threads [01:25] wiki apparently gives me a php warning when i edit the talk page? [01:25] Warning: file_get_contents(/home/archivet/public_html/extensions/SpamBlacklist/wikimedia_blacklist) [function.file-get-contents]: failed to open stream: No such file or directory in /home/archivet/public_html/extensions/SpamBlacklist/SpamBlacklist_body.php on line 123
[02:16] *** zooko` (~user@[redacted]) has joined #internetarchive.bak [02:26] *** furry5 (~furry5@[redacted]) has joined #internetarchive.bak [02:26] Hi guys [02:27] I need help, and I dunno if this is the right spot to ask [02:27] *** zooko has quit (Read error: Operation timed out) [02:31] *** furry5 has quit (Quit: furry5) [02:31] *** Kacey25 (~QbbMLqNb@[redacted]) has joined #internetarchive.bak [02:32] *** Kacey25 has quit (Read error: Connection reset by peer) [02:32] *** chfoo sets mode +s #internetarchive.bak [02:41] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak [02:41] heh, is that the same zooko behind Zooko's Triangle [03:06] Yes. [03:06] *** zooko` is now known as zooko [03:09] http://i.imgur.com/EyW7Krb.gif [03:19] *** garyrh has quit (Remote host closed the connection) [03:23] that gif should go on the wiki page [03:25] We'd be the hippest dudes in the playground with that radical flaming text [03:42] *** garyrh (garyrh@[redacted]) has joined #internetarchive.bak [03:46] to protect against bad actors, including the most recent repo of hashes with each block sent out would mean that it'd be very obvious if corruption or bad actors changed a block [03:47] even if they had half the nodes with the bad version, it'd be blatantly obvious that more than one "correct" version was out there, and thus someone's playing silly buggers [03:48] I think the biggest problem this will have, assuming you even reach tens of thousands of people with spare hard drives, is people dropping off the network as they lose interest in just weeks/months [03:48] yes [03:48] that [03:49] try collaborating with libraries, particularly national/university ones? [04:17] https://www.kickstarter.com/projects/1753332742/back-up-the-internet?ref=category_popular [04:22] we can do one step further and use bluray discs like facebook :P [04:31] What about using microfilm? [04:32] or ZIP disks [04:43] *** zooko` (~user@[redacted]) has joined #internetarchive.bak [04:45] *** zooko has quit (Read error: Operation timed out) [05:09] *** lhobas has quit (hub.dk efnet.port80.se) [05:09] *** pikhq has quit (hub.dk irc.homelien.no) [05:09] *** yipdw has quit (hub.dk irc.homelien.no) [05:17] *** Kazzy has quit (hub.efnet.us hub.dk) [05:17] *** Kenshin has quit (hub.efnet.us hub.dk) [05:17] *** SketchCow has quit (hub.efnet.us hub.dk) [05:17] *** Void_ has quit (hub.efnet.us hub.dk) [05:17] *** svchfoo2 has quit (hub.efnet.us hub.dk) [05:42] or just bittorrent [05:46] This is basically the expected use-case for Tahoe-LAFS. Not everything has to fit in one place at the same time, storage devices aren't trusted, intended to store a lot of data. [05:56] except that a volunteer can't just look at their disk and see useful data [05:57] I suppose if that's a goal, maybe not? (Although there may be a way to do that with LAFS, I have only relatively limited experience with it.) [05:58] Not sure how useful that would be, though. If people only mirrored their favorite parts of IA, they could do that already, and there probably would be a lot of redundancy in some areas, and no backups of others. [05:59] If you're just mirroring pseudorandom algorithm-assigned chunks, other than a minor curiosity, most users won't get much out of browsing their own local copy of a few [hundred] gigs. [05:59] yea [06:00] the "chunks" have to be somewhat meaningful; it can't split items up into unusable bits [06:00] I think bittorrent is the way to go [06:00] we already have a torrent for each item, and it solves the tampering problem [06:01] you can read everything but if you mess with it your pieces no longer verify [06:01] could distribute a custom bittorrent client that automates the process of deciding which users join which swarms [06:02] Your pieces might not verify then, but unless you have some process randomly requesting chunks from clients on a continuous basis, you can't prove they haven't just stored the hashes and not the data. [06:02] (Which would be a jerky thing to do, but if we're talking about bad actors...) [06:03] naturally [06:03] the same is true of any other solution [06:03] Well. [06:04] If you were willing to write your own software, you could have the clients take a challenge, HMAC their stored chunk with that challenge, and return the hash. At least then you can prove that they had the data if you have it yourself, and you don't have to transfer as much data. [06:05] I'm not aware of any software set up for storing data remotely that does that already, but that doesn't mean it doesn't exist, I suppose. Certainly wouldn't be all that hard to write, either. [06:05] hmm [06:05] that would probably work [06:05] peers could do the verification themselves even [06:06] True, although ferreting out bad peers in that system might almost be more work than it's worth, assuming that IA scrubs their datastores on a semi-regular basis, they'd be doing the reads anyway. [06:06] yea [06:07] Anyway, I have to sleep. Good luck, everyone. [06:08] (And to chfoo's comment, I have a vague project to automate BD-Rs as bulk storage/medium-term archival media. Not sure how that'll go.) [06:18] *** zooko` has quit (Remote host closed the connection) [06:34] *** bzc6p__ has quit (bzc6p__) [07:15] *** Kazzy (~Kaz@[redacted]) has joined #internetarchive.bak [07:15] *** Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak [07:15] *** Rotab (~Rotab@[redacted]) has joined #internetarchive.bak [07:15] *** SketchCow (~jscott@[redacted]) has joined #internetarchive.bak [07:15] *** Void_ (~Void@[redacted]) has joined #internetarchive.bak [07:15] *** hub.se gives channel operator status to SketchCow [07:15] *** hub.se gives channel operator status to pikhq yipdw Kazzy svchfoo2 [07:15] *** lhobas (sid41114@[redacted]) has joined #internetarchive.bak [07:15] *** pikhq (~pikhq@[redacted]) has joined #internetarchive.bak [07:15] *** svchfoo2 (~chfoo2@[redacted]) has joined #internetarchive.bak [07:15] *** yipdw (~yipdw@[redacted]) has joined #internetarchive.bak [09:37] *** bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak [11:54] *** bzc6p has quit (bzc6p) [12:23] Back in SF. [12:25] Thanks for the information so far, tephra_ - I've gone ahead and made it into a table. I am suspecting that we can improve that script to run much faster, but for now, we're getting the information I am hoping for. [12:25] For example, the Ephemera films collection is very manageable, 10tb [12:33] yes no doubt the script can be made better :) almost done with the software library had to do a restart on that for some reason. Also started with a big collection newsandpublicaffairs which has 103819 items [12:40] Just add a few more to the table. [12:40] This is very helpful stuff. [12:47] So, several things, as I've been thinking about this on the truck. [12:47] - It needs registration. It NEEDS it. Tie it to the archive.org library card system, so people can be reached. [12:48] - Initially work with file directories connected to the net directly (no cold storage) [12:48] - Have this system so that if chunks are lost, they drop out [12:48] - Lower latency for this, therefore - checking once a month. [12:49] Initially, we want to talk to people who have filesystems that are idle and have a lot of space on them, so they might have a few TB around, and this is filled with chunks. [12:49] And they just delete chunks when they need the space back. [12:49] This will work with a LOT of idle filesystems that don't do much. [12:50] Over time, we can expand it out to cold storage and other such items, which will make life more difficult but make more copies. [12:51] So, I'm using a different method than you, tephra, but.... [12:52] Items that are simply mediatype "movies" (and this includes dark ones just because I didn't run the query right), results in 893,057 items, and a total of 741 terabytes. [12:54] hmmm according to IA there should be 1,888,887 items with mediatype movies [12:55] Well, don't fret, I'm using a different method than you that queries the database differently. [12:55] ah ok [12:56] In fact, it's 512,308 items according to this (not dark) [12:56] Computermagazines had the same issue - my set was MUCH smaller than yours. [12:57] Jake, who knows all things collection-searching, will be helpful to understanding what's going on. [12:58] the script now just does a search for "collection:" using the IA api which seems to be exactly the same as using the search on the site [12:59] so it should be 13065 items in the computermagazines collection but yeah it easy to rerun if needed and bugs are found [13:03] I am going to go get a little more sleep [13:03] But I am pleased because the problems are nailing down VERY quickly. [13:03] I can provide some good space for testing, as I hope can others. [13:06] Just checked, movies goes down to 574tb in my method (no darks) [13:09] how many items? [13:13] SketchCow: I think you're on to something with this being an unused space filler. Need space for something else? IA not currently on fire? Delete at will.. [13:14] or even better, if it auto-deleted to keep X% of a drive available [13:25] can someone find out the total number of items in IA? [15:15] We're going to doa census. [15:15] Bear in min dit was never quite designed that way, so it'll be "fun" [15:15] But I got news for you, it will be well over a million. [15:15] what is an "item" anyway? [15:16] a logical “thing” that we present on one web page [15:17] there are lots of items that are like, "collection number 12345 of 67890" and themselves are collections of documents [15:18] fenn: http://blog.archive.org/2011/03/31/how-archive-org-items-are-structured/ [15:18] yes i am on that page [15:19] would your census count things in the wayback machine? i think the wayback is probably what most people think of when they hear "internet archive" [15:19] *** Start has quit (Disconnected.) [15:20] also, letting people pick what they want to store is stupid; they could just download it themselves if they wanted to do that [15:23] 24,598,934 items. [15:23] No, people do not pick what they store. [15:23] This will be accompanied with tools and information on downlading what you want, which we should do anyway. [15:26] bittorrent and DHT provides a pretty good way of letting people download stuff they are interested in, but also share it via a discoverable and systematic way (no herky jerky "go to my websites at ..." type sharing) [15:28] i like "a custom bittorrent client that automates the process of deciding which users join which swarms" [15:28] but i don't know enough about tahoe-lafs to compare it [15:28] SketchCow: got very small numbers for the software library but put them up [15:33] git annex seems more useful for filesystems that change a lot, and you need to preserve the history of those changes. i don't think archival items will be changing very much, if at all. [15:45] it might be possible to do both tahoe-lafs and bittorrent/DHT at the same time, since they are operating on the same data [15:48] tephra_: Those numbers aren't actually small [16:04] *** Start (~Start@[redacted]) has joined #internetarchive.bak [16:12] *** bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak [16:51] *** Start has quit (Disconnected.) [17:47] SketchCow: good [18:02] *** Start (~Start@[redacted]) has joined #internetarchive.bak [18:26] We've never had to do this before in this way, so this will be some interesting census. [18:38] *** antomatic (~antomatic@[redacted]) has joined #internetarchive.bak [18:41] *** Start has quit (Disconnected.) [20:01] *** Start (~Start@[redacted]) has joined #internetarchive.bak [20:24] *** aschmitz has quit (Read error: Operation timed out) [20:28] *** Start has quit (Disconnected.) [20:31] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak [20:56] *** Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak [20:58] "Hey, what about..." [20:58] it would be REALLY cool if I could *choose* what I want to back up locally [20:59] like, "boats really interest me, so I want my backups to contain web sites about boats, books about boats, documentaries about boats, etc" [21:28] *** Start (~Start@[redacted]) has joined #internetarchive.bak [21:29] *** Start has quit (Read error: Connection reset by peer) [21:46] Definitely - people are more likely to care and take care of someone else's data if it means something to them too [21:46] *** Start (~Start@[redacted]) has joined #internetarchive.bak [21:47] FYI [21:47] 24,598,934 items. [21:47] No, people do not pick what they store. [21:47] also, letting people pick what they want to store is stupid; they could just download it themselves if they wanted to do that [21:47] *** antomatic shrugs [21:48] disallowing choice also makes it easier to get good redundancy [21:49] it's also one less question you need to ask someone [21:50] true, and solid reasoning, of course. It just removes a hook that might otherwise really sell the project to people [21:50] e.g. You can help mirror IA's Secret UFO Files.. etc. [21:51] If there are enough people who would queue up to store chunks of arbitrary data [and of course, hopefully there will be] then no problem [21:52] "I'm helping IA store the entire works of Miguel Esperanto" [Like] [Retweet] etc etc etc [21:53] Specificity (although a nuisance from a management point of view) does also make it more attractive to the masses [21:56] mass involvement isn't an initial goal (it's easier; also SketchCow wants IA registration to ease contact) [21:56] Seriously, how many of us does really care what they archive when running a Warrior? This could be the same for storing archives – there would also be who don't really care. And if so, the freedom of choice could be given. [21:56] additionally choice isn't required for massive popularity; see e.g. SETI@home [21:56] run something like http://statusboard.archive.org/ and you have a fun UI [21:56] er, run something like that locally [21:57] Mm, I do see the point. I'd disagree that Warrior runners don't care what they work on, though. [21:57] SETI is interesting _because_ it is doing something you're interested in - e.g. UFOs, Aliens, and all that whizz [21:58] "Run a Warrior and archive indiscriminate things" is one thing [21:58] "Run a Warrior and help save DeviantArt" is a whole nother thing. [21:58] Well, it may be only me [21:58] The second one immediately makes people care, if DA is their thing. [21:58] or if they recognise its significance [21:59] But here's a logical approach: [21:59] say, it's a selection-based system. Someone wants to store a topic, but everything's already "out". Then he says: "Yuck, I don't store any of the other shit." [22:00] Now say there is a random handout scheme. [22:00] Would the same person store 500gb chunks of random shit? [22:00] erm... content [22:01] And also the risk that some people might want to help but don't, or *absolutely can not* risk possession of immoral or illegal shit [22:01] 'No, it's an archive' doesn't cut it if your PC gets taken away by the cops [22:01] So, for some people it may not matter, for some others it may, and the selection-based one is – in my logic – better for more. [22:01] So immediately you already can't be non-specific. It has to be "help save the IA, but not all of it" [22:02] for a first cut it's going to be people who accept that [22:02] true [22:02] you're talking about iteration 17 worries [22:02] this is iteration 0 [22:02] that's fair [22:02] it needs the firm foundations of early adopters who are up for anthing - agreed [22:02] *anything [22:04] Are we so optimistic that we so quickly reach iteration 1? [22:04] I'd worry also about the first 10 PB. [22:05] This is a bit like the discussions we had in #huntinggrounds [22:05] as regards large storage, etc [22:05] I wasn't there, but I guess it's a *bit* more amount. [22:05] I wonder what the 'core' amount is [22:05] - i.e. originals, not derived [22:06] - books and documents (women and children) first, [22:06] etc [22:06] is wayback data more or less important than video [22:06] etc etc etc [22:06] priorities, etc [22:06] but obviously "do everything" makes it easier [22:07] and is a better approach when the resources permit [22:07] wayback probably would be priority #1 in terms of being irreplacable [22:08] (there is at least a chance that other copies of books & documents exist elsewhere; a situation which doesn't apply to wayback) [22:08] antomatic, I have bad news, which you may also know [22:09] Wayback is 10 PB itself [22:09] However, maybe it could be dedupped a bit [22:09] maybe that is how you sell it. "Help backup the Wayback machine" - that puts a face on the indiscriminate data and is at least something that people would understand and get behind in decent numbers [22:09] bzc6p: Eek, lots of data. :) [22:10] Hm. There may be some desync on the wiki [22:10] Bundle the software with a screensaver that shows random old web pages :) [22:10] SketchCow mentioned 20 PB data with derivatives [22:11] Wiki says 18.5 PB uniqe data [22:11] and 50 PB total [22:13] nm [22:14] maybe the use of the word "unique" confused me [22:14] I think derivatives are not uniqe [22:14] derivatives can always be recreated [22:15] assuming continued possession of the original [22:15] Still, many petabytes. [22:16] mmm [22:16] But I think it's already benn discussed here or there, so now I shut up. [22:17] me too :) [22:19] *** Start has quit (Disconnected.) [22:23] I wouldn't mind my data being 50% stuff with my chosen keywords, and 50% "Internet Archive's choice" (like, the stuff that needs redundancy right now0 [22:23] ) [22:27] important things could be used as filler between stuff the user likes [22:42] SketchCow: here's a full worked proposed design for using git-annex for this. Deals with scalability. [22:42] http://git-annex.branchable.com/design/iabackup/ [22:52] ooh looks nice [23:01] sweet [23:09] Result of thinking while driving and hiking all day today :) [23:21] *** Start (~Start@[redacted]) has joined #internetarchive.bak [23:22] *** svchfoo1 gives channel operator status to Start