#internetarchive.bak 2015-03-03,Tue

↑back Search

Time Nickname Message
00:06 πŸ”— tephra_ rig everything works now, got the three on the wiki going. It's about 1 item per second so when i wake up in 6hours they should be done getting to the movies one after that
00:09 πŸ”— tephra_ the script can be grabbed here: https://gist.github.com/f77f094032110a7b51e7.git
00:10 πŸ”— tephra_ no wait let me change the name of that first
00:11 πŸ”— tephra_ done
00:11 πŸ”— tephra_ just run with `python ia-colletion-size.py <collection-name>`
00:12 πŸ”— tephra_ i'm doing ephemera, computermagazines and softwarelibrary
00:43 πŸ”— bzc6p__ (~bzc6p@[redacted]) has joined #internetarchive.bak
00:45 πŸ”— zooko (~user@[redacted]) has joined #internetarchive.bak
00:45 πŸ”— zooko Hi folks! I'm here because fenn told me about http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK and I'm one of the authors of Tahoe-LAFS.
00:47 πŸ”— garyrh (garyrh@[redacted]) has joined #internetarchive.bak
00:49 πŸ”— bzc6p_ has quit (Ping timeout: 600 seconds)
00:52 πŸ”— S[h]O[r]T (omgitsme@[redacted]) has joined #internetarchive.bak
00:56 πŸ”— Rotab has quit (hub.se irc.du.se)
01:17 πŸ”— SketchCow hi zooko
01:22 πŸ”— SketchCow twph... I suspect faster metjods exist but we will stick with this for now
01:22 πŸ”— SketchCow ia mine does parallel threads
01:25 πŸ”— S[h]O[r]T wiki apparently gives me a php warning when i edit the talk page?
01:25 πŸ”— S[h]O[r]T <b>Warning</b>: file_get_contents(/home/archivet/public_html/extensions/SpamBlacklist/wikimedia_blacklist) [<a href='function.file-get-contents'>function.file-get-contents</a>]: failed to open stream: No such file or directory in <b>/home/archivet/public_html/extensions/SpamBlacklist/SpamBlacklist_body.php</b> on line <b>123</b><br />
02:16 πŸ”— zooko` (~user@[redacted]) has joined #internetarchive.bak
02:26 πŸ”— furry5 (~furry5@[redacted]) has joined #internetarchive.bak
02:26 πŸ”— furry5 Hi guys
02:27 πŸ”— furry5 I need help, and I dunno if this is the right spot to ask
02:27 πŸ”— zooko has quit (Read error: Operation timed out)
02:31 πŸ”— furry5 has quit (Quit: furry5)
02:31 πŸ”— Kacey25 (~QbbMLqNb@[redacted]) has joined #internetarchive.bak
02:32 πŸ”— Kacey25 has quit (Read error: Connection reset by peer)
02:32 πŸ”— chfoo sets mode +s #internetarchive.bak
02:41 πŸ”— aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
02:41 πŸ”— yipdw heh, is that the same zooko behind Zooko's Triangle
03:06 πŸ”— zooko` Yes.
03:06 πŸ”— zooko` is now known as zooko
03:09 πŸ”— Ctrl-S http://i.imgur.com/EyW7Krb.gif
03:19 πŸ”— garyrh has quit (Remote host closed the connection)
03:23 πŸ”— chfoo that gif should go on the wiki page
03:25 πŸ”— Ctrl-S We'd be the hippest dudes in the playground with that radical flaming text
03:42 πŸ”— garyrh (garyrh@[redacted]) has joined #internetarchive.bak
03:46 πŸ”— Ctrl-S to protect against bad actors, including the most recent repo of hashes with each block sent out would mean that it'd be very obvious if corruption or bad actors changed a block
03:47 πŸ”— Ctrl-S even if they had half the nodes with the bad version, it'd be blatantly obvious that more than one "correct" version was out there, and thus someone's playing silly buggers
03:48 πŸ”— ivan` I think the biggest problem this will have, assuming you even reach tens of thousands of people with spare hard drives, is people dropping off the network as they lose interest in just weeks/months
03:48 πŸ”— Ctrl-S yes
03:48 πŸ”— Ctrl-S that
03:49 πŸ”— Ctrl-S try collaborating with libraries, particularly national/university ones?
04:17 πŸ”— S[h]O[r]T https://www.kickstarter.com/projects/1753332742/back-up-the-internet?ref=category_popular
04:22 πŸ”— chfoo we can do one step further and use bluray discs like facebook :P
04:31 πŸ”— Ctrl-S What about using microfilm?
04:32 πŸ”— Ctrl-S or ZIP disks
04:43 πŸ”— zooko` (~user@[redacted]) has joined #internetarchive.bak
04:45 πŸ”— zooko has quit (Read error: Operation timed out)
05:09 πŸ”— lhobas has quit (hub.dk efnet.port80.se)
05:09 πŸ”— pikhq has quit (hub.dk irc.homelien.no)
05:09 πŸ”— yipdw has quit (hub.dk irc.homelien.no)
05:17 πŸ”— Kazzy has quit (hub.efnet.us hub.dk)
05:17 πŸ”— Kenshin has quit (hub.efnet.us hub.dk)
05:17 πŸ”— SketchCow has quit (hub.efnet.us hub.dk)
05:17 πŸ”— Void_ has quit (hub.efnet.us hub.dk)
05:17 πŸ”— svchfoo2 has quit (hub.efnet.us hub.dk)
05:42 πŸ”— db48x or just bittorrent
05:46 πŸ”— aschmitz This is basically the expected use-case for Tahoe-LAFS. Not everything has to fit in one place at the same time, storage devices aren't trusted, intended to store a lot of data.
05:56 πŸ”— db48x except that a volunteer can't just look at their disk and see useful data
05:57 πŸ”— aschmitz I suppose if that's a goal, maybe not? (Although there may be a way to do that with LAFS, I have only relatively limited experience with it.)
05:58 πŸ”— aschmitz Not sure how useful that would be, though. If people only mirrored their favorite parts of IA, they could do that already, and there probably would be a lot of redundancy in some areas, and no backups of others.
05:59 πŸ”— aschmitz If you're just mirroring pseudorandom algorithm-assigned chunks, other than a minor curiosity, most users won't get much out of browsing their own local copy of a few [hundred] gigs.
05:59 πŸ”— db48x yea
06:00 πŸ”— db48x the "chunks" have to be somewhat meaningful; it can't split items up into unusable bits
06:00 πŸ”— db48x I think bittorrent is the way to go
06:00 πŸ”— db48x we already have a torrent for each item, and it solves the tampering problem
06:01 πŸ”— db48x you can read everything but if you mess with it your pieces no longer verify
06:01 πŸ”— db48x could distribute a custom bittorrent client that automates the process of deciding which users join which swarms
06:02 πŸ”— aschmitz Your pieces might not verify then, but unless you have some process randomly requesting chunks from clients on a continuous basis, you can't prove they haven't just stored the hashes and not the data.
06:02 πŸ”— aschmitz (Which would be a jerky thing to do, but if we're talking about bad actors...)
06:03 πŸ”— db48x naturally
06:03 πŸ”— db48x the same is true of any other solution
06:03 πŸ”— aschmitz Well.
06:04 πŸ”— aschmitz If you were willing to write your own software, you could have the clients take a challenge, HMAC their stored chunk with that challenge, and return the hash. At least then you can prove that they had the data if you have it yourself, and you don't have to transfer as much data.
06:05 πŸ”— aschmitz I'm not aware of any software set up for storing data remotely that does that already, but that doesn't mean it doesn't exist, I suppose. Certainly wouldn't be all that hard to write, either.
06:05 πŸ”— db48x hmm
06:05 πŸ”— db48x that would probably work
06:05 πŸ”— db48x peers could do the verification themselves even
06:06 πŸ”— aschmitz True, although ferreting out bad peers in that system might almost be more work than it's worth, assuming that IA scrubs their datastores on a semi-regular basis, they'd be doing the reads anyway.
06:06 πŸ”— db48x yea
06:07 πŸ”— aschmitz Anyway, I have to sleep. Good luck, everyone.
06:08 πŸ”— aschmitz (And to chfoo's comment, I have a vague project to automate BD-Rs as bulk storage/medium-term archival media. Not sure how that'll go.)
06:18 πŸ”— zooko` has quit (Remote host closed the connection)
06:34 πŸ”— bzc6p__ has quit (bzc6p__)
07:15 πŸ”— Kazzy (~Kaz@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— Rotab (~Rotab@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— SketchCow (~jscott@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— Void_ (~Void@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— hub.se gives channel operator status to SketchCow
07:15 πŸ”— hub.se gives channel operator status to pikhq yipdw Kazzy svchfoo2
07:15 πŸ”— lhobas (sid41114@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— pikhq (~pikhq@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— svchfoo2 (~chfoo2@[redacted]) has joined #internetarchive.bak
07:15 πŸ”— yipdw (~yipdw@[redacted]) has joined #internetarchive.bak
09:37 πŸ”— bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
11:54 πŸ”— bzc6p has quit (bzc6p)
12:23 πŸ”— SketchCow Back in SF.
12:25 πŸ”— SketchCow Thanks for the information so far, tephra_ - I've gone ahead and made it into a table. I am suspecting that we can improve that script to run much faster, but for now, we're getting the information I am hoping for.
12:25 πŸ”— SketchCow For example, the Ephemera films collection is very manageable, 10tb
12:33 πŸ”— tephra_ yes no doubt the script can be made better :) almost done with the software library had to do a restart on that for some reason. Also started with a big collection newsandpublicaffairs which has 103819 items
12:40 πŸ”— SketchCow Just add a few more to the table.
12:40 πŸ”— SketchCow This is very helpful stuff.
12:47 πŸ”— SketchCow So, several things, as I've been thinking about this on the truck.
12:47 πŸ”— SketchCow - It needs registration. It NEEDS it. Tie it to the archive.org library card system, so people can be reached.
12:48 πŸ”— SketchCow - Initially work with file directories connected to the net directly (no cold storage)
12:48 πŸ”— SketchCow - Have this system so that if chunks are lost, they drop out
12:48 πŸ”— SketchCow - Lower latency for this, therefore - checking once a month.
12:49 πŸ”— SketchCow Initially, we want to talk to people who have filesystems that are idle and have a lot of space on them, so they might have a few TB around, and this is filled with chunks.
12:49 πŸ”— SketchCow And they just delete chunks when they need the space back.
12:49 πŸ”— SketchCow This will work with a LOT of idle filesystems that don't do much.
12:50 πŸ”— SketchCow Over time, we can expand it out to cold storage and other such items, which will make life more difficult but make more copies.
12:51 πŸ”— SketchCow So, I'm using a different method than you, tephra, but....
12:52 πŸ”— SketchCow Items that are simply mediatype "movies" (and this includes dark ones just because I didn't run the query right), results in 893,057 items, and a total of 741 terabytes.
12:54 πŸ”— tephra_ hmmm according to IA there should be 1,888,887 items with mediatype movies
12:55 πŸ”— SketchCow Well, don't fret, I'm using a different method than you that queries the database differently.
12:55 πŸ”— tephra_ ah ok
12:56 πŸ”— SketchCow In fact, it's 512,308 items according to this (not dark)
12:56 πŸ”— SketchCow Computermagazines had the same issue - my set was MUCH smaller than yours.
12:57 πŸ”— SketchCow Jake, who knows all things collection-searching, will be helpful to understanding what's going on.
12:58 πŸ”— tephra_ the script now just does a search for "collection:<name>" using the IA api which seems to be exactly the same as using the search on the site
12:59 πŸ”— tephra_ so it should be 13065 items in the computermagazines collection but yeah it easy to rerun if needed and bugs are found
13:03 πŸ”— SketchCow I am going to go get a little more sleep
13:03 πŸ”— SketchCow But I am pleased because the problems are nailing down VERY quickly.
13:03 πŸ”— SketchCow I can provide some good space for testing, as I hope can others.
13:06 πŸ”— SketchCow Just checked, movies goes down to 574tb in my method (no darks)
13:09 πŸ”— tephra_ how many items?
13:13 πŸ”— closure SketchCow: I think you're on to something with this being an unused space filler. Need space for something else? IA not currently on fire? Delete at will..
13:14 πŸ”— closure or even better, if it auto-deleted to keep X% of a drive available
13:25 πŸ”— closure can someone find out the total number of items in IA?
15:15 πŸ”— SketchCow We're going to doa census.
15:15 πŸ”— SketchCow Bear in min dit was never quite designed that way, so it'll be "fun"
15:15 πŸ”— SketchCow But I got news for you, it will be well over a million.
15:15 πŸ”— fenn what is an "item" anyway?
15:16 πŸ”— fenn a logical β€œthing” that we present on one web page
15:17 πŸ”— fenn there are lots of items that are like, "collection number 12345 of 67890" and themselves are collections of documents
15:18 πŸ”— tephra_ fenn: http://blog.archive.org/2011/03/31/how-archive-org-items-are-structured/
15:18 πŸ”— fenn yes i am on that page
15:19 πŸ”— fenn would your census count things in the wayback machine? i think the wayback is probably what most people think of when they hear "internet archive"
15:19 πŸ”— Start has quit (Disconnected.)
15:20 πŸ”— fenn also, letting people pick what they want to store is stupid; they could just download it themselves if they wanted to do that
15:23 πŸ”— SketchCow 24,598,934 items.
15:23 πŸ”— SketchCow No, people do not pick what they store.
15:23 πŸ”— SketchCow This will be accompanied with tools and information on downlading what you want, which we should do anyway.
15:26 πŸ”— fenn bittorrent and DHT provides a pretty good way of letting people download stuff they are interested in, but also share it via a discoverable and systematic way (no herky jerky "go to my websites at ..." type sharing)
15:28 πŸ”— fenn i like "a custom bittorrent client that automates the process of deciding which users join which swarms"
15:28 πŸ”— fenn but i don't know enough about tahoe-lafs to compare it
15:28 πŸ”— tephra_ SketchCow: got very small numbers for the software library but put them up
15:33 πŸ”— fenn git annex seems more useful for filesystems that change a lot, and you need to preserve the history of those changes. i don't think archival items will be changing very much, if at all.
15:45 πŸ”— fenn it might be possible to do both tahoe-lafs and bittorrent/DHT at the same time, since they are operating on the same data
15:48 πŸ”— SketchCow tephra_: Those numbers aren't actually small
16:04 πŸ”— Start (~Start@[redacted]) has joined #internetarchive.bak
16:12 πŸ”— bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
16:51 πŸ”— Start has quit (Disconnected.)
17:47 πŸ”— tephra_ SketchCow: good
18:02 πŸ”— Start (~Start@[redacted]) has joined #internetarchive.bak
18:26 πŸ”— SketchCow We've never had to do this before in this way, so this will be some interesting census.
18:38 πŸ”— antomatic (~antomatic@[redacted]) has joined #internetarchive.bak
18:41 πŸ”— Start has quit (Disconnected.)
20:01 πŸ”— Start (~Start@[redacted]) has joined #internetarchive.bak
20:24 πŸ”— aschmitz has quit (Read error: Operation timed out)
20:28 πŸ”— Start has quit (Disconnected.)
20:31 πŸ”— aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
20:56 πŸ”— Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak
20:58 πŸ”— Sanqui "Hey, what about..."
20:58 πŸ”— Sanqui it would be REALLY cool if I could *choose* what I want to back up locally
20:59 πŸ”— Sanqui like, "boats really interest me, so I want my backups to contain web sites about boats, books about boats, documentaries about boats, etc"
21:28 πŸ”— Start (~Start@[redacted]) has joined #internetarchive.bak
21:29 πŸ”— Start has quit (Read error: Connection reset by peer)
21:46 πŸ”— antomatic Definitely - people are more likely to care and take care of someone else's data if it means something to them too
21:46 πŸ”— Start (~Start@[redacted]) has joined #internetarchive.bak
21:47 πŸ”— yipdw FYI
21:47 πŸ”— yipdw <SketchCow> 24,598,934 items.
21:47 πŸ”— yipdw <SketchCow> No, people do not pick what they store.
21:47 πŸ”— yipdw <fenn> also, letting people pick what they want to store is stupid; they could just download it themselves if they wanted to do that
21:47 πŸ”— antomatic shrugs
21:48 πŸ”— yipdw disallowing choice also makes it easier to get good redundancy
21:49 πŸ”— yipdw it's also one less question you need to ask someone
21:50 πŸ”— antomatic true, and solid reasoning, of course. It just removes a hook that might otherwise really sell the project to people
21:50 πŸ”— antomatic e.g. You can help mirror IA's Secret UFO Files.. etc.
21:51 πŸ”— antomatic If there are enough people who would queue up to store chunks of arbitrary data [and of course, hopefully there will be] then no problem
21:52 πŸ”— antomatic "I'm helping IA store the entire works of Miguel Esperanto" [Like] [Retweet] etc etc etc
21:53 πŸ”— antomatic Specificity (although a nuisance from a management point of view) does also make it more attractive to the masses
21:56 πŸ”— yipdw mass involvement isn't an initial goal (it's easier; also SketchCow wants IA registration to ease contact)
21:56 πŸ”— bzc6p Seriously, how many of us does really care what they archive when running a Warrior? This could be the same for storing archives – there would also be who don't really care. And if so, the freedom of choice could be given.
21:56 πŸ”— yipdw additionally choice isn't required for massive popularity; see e.g. SETI@home
21:56 πŸ”— yipdw run something like http://statusboard.archive.org/ and you have a fun UI
21:56 πŸ”— yipdw er, run something like that locally
21:57 πŸ”— antomatic Mm, I do see the point. I'd disagree that Warrior runners don't care what they work on, though.
21:57 πŸ”— antomatic SETI is interesting _because_ it is doing something you're interested in - e.g. UFOs, Aliens, and all that whizz
21:58 πŸ”— antomatic "Run a Warrior and archive indiscriminate things" is one thing
21:58 πŸ”— antomatic "Run a Warrior and help save DeviantArt" is a whole nother thing.
21:58 πŸ”— bzc6p Well, it may be only me
21:58 πŸ”— antomatic The second one immediately makes people care, if DA is their thing.
21:58 πŸ”— antomatic or if they recognise its significance
21:59 πŸ”— bzc6p But here's a logical approach:
21:59 πŸ”— bzc6p say, it's a selection-based system. Someone wants to store a topic, but everything's already "out". Then he says: "Yuck, I don't store any of the other shit."
22:00 πŸ”— bzc6p Now say there is a random handout scheme.
22:00 πŸ”— bzc6p Would the same person store 500gb chunks of random shit?
22:00 πŸ”— bzc6p erm... content
22:01 πŸ”— antomatic And also the risk that some people might want to help but don't, or *absolutely can not* risk possession of immoral or illegal shit
22:01 πŸ”— antomatic 'No, it's an archive' doesn't cut it if your PC gets taken away by the cops
22:01 πŸ”— bzc6p So, for some people it may not matter, for some others it may, and the selection-based one is – in my logic – better for more.
22:01 πŸ”— antomatic So immediately you already can't be non-specific. It has to be "help save the IA, but not all of it"
22:02 πŸ”— yipdw for a first cut it's going to be people who accept that
22:02 πŸ”— antomatic true
22:02 πŸ”— yipdw you're talking about iteration 17 worries
22:02 πŸ”— yipdw this is iteration 0
22:02 πŸ”— antomatic that's fair
22:02 πŸ”— antomatic it needs the firm foundations of early adopters who are up for anthing - agreed
22:02 πŸ”— antomatic *anything
22:04 πŸ”— bzc6p Are we so optimistic that we so quickly reach iteration 1?
22:04 πŸ”— bzc6p I'd worry also about the first 10 PB.
22:05 πŸ”— antomatic This is a bit like the discussions we had in #huntinggrounds
22:05 πŸ”— antomatic as regards large storage, etc
22:05 πŸ”— bzc6p I wasn't there, but I guess it's a *bit* more amount.
22:05 πŸ”— antomatic I wonder what the 'core' amount is
22:05 πŸ”— antomatic - i.e. originals, not derived
22:06 πŸ”— antomatic - books and documents (women and children) first,
22:06 πŸ”— antomatic etc
22:06 πŸ”— antomatic is wayback data more or less important than video
22:06 πŸ”— antomatic etc etc etc
22:06 πŸ”— antomatic priorities, etc
22:06 πŸ”— antomatic but obviously "do everything" makes it easier
22:07 πŸ”— antomatic and is a better approach when the resources permit
22:07 πŸ”— antomatic wayback probably would be priority #1 in terms of being irreplacable
22:08 πŸ”— antomatic (there is at least a chance that other copies of books & documents exist elsewhere; a situation which doesn't apply to wayback)
22:08 πŸ”— bzc6p antomatic, I have bad news, which you may also know
22:09 πŸ”— bzc6p Wayback is 10 PB itself
22:09 πŸ”— bzc6p However, maybe it could be dedupped a bit
22:09 πŸ”— antomatic maybe that is how you sell it. "Help backup the Wayback machine" - that puts a face on the indiscriminate data and is at least something that people would understand and get behind in decent numbers
22:09 πŸ”— antomatic bzc6p: Eek, lots of data. :)
22:10 πŸ”— bzc6p Hm. There may be some desync on the wiki
22:10 πŸ”— antomatic Bundle the software with a screensaver that shows random old web pages :)
22:10 πŸ”— bzc6p SketchCow mentioned 20 PB data with derivatives
22:11 πŸ”— bzc6p Wiki says 18.5 PB uniqe data
22:11 πŸ”— bzc6p and 50 PB total
22:13 πŸ”— bzc6p nm
22:14 πŸ”— bzc6p maybe the use of the word "unique" confused me
22:14 πŸ”— bzc6p I think derivatives are not uniqe
22:14 πŸ”— antomatic derivatives can always be recreated
22:15 πŸ”— antomatic assuming continued possession of the original
22:15 πŸ”— bzc6p Still, many petabytes.
22:16 πŸ”— antomatic mmm
22:16 πŸ”— bzc6p But I think it's already benn discussed here or there, so now I shut up.
22:17 πŸ”— antomatic me too :)
22:19 πŸ”— Start has quit (Disconnected.)
22:23 πŸ”— Sanqui I wouldn't mind my data being 50% stuff with my chosen keywords, and 50% "Internet Archive's choice" (like, the stuff that needs redundancy right now0
22:23 πŸ”— Sanqui )
22:27 πŸ”— Ctrl-S important things could be used as filler between stuff the user likes
22:42 πŸ”— closure SketchCow: here's a full worked proposed design for using git-annex for this. Deals with scalability.
22:42 πŸ”— closure http://git-annex.branchable.com/design/iabackup/
22:52 πŸ”— DFJustin ooh looks nice
23:01 πŸ”— yipdw sweet
23:09 πŸ”— closure Result of thinking while driving and hiking all day today :)
23:21 πŸ”— Start (~Start@[redacted]) has joined #internetarchive.bak
23:22 πŸ”— svchfoo1 gives channel operator status to Start

irclogger-viewer