#internetarchive.bak 2020-07-05,Sun

↑back Search

Time Nickname Message
04:04 🔗 Somebody2 Ooh, the https://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/iabak-sharp_implementation looks neat!
12:15 🔗 HP_Archiv has joined #internetarchive.bak
13:32 🔗 HP_Archiv has quit IRC (Quit: Leaving)
13:36 🔗 HP_Archiv has joined #internetarchive.bak
13:47 🔗 HP_Archiv has quit IRC (Quit: Leaving)
14:12 🔗 Kaz sets mode: +o Somebody2
14:13 🔗 Kaz oh looks neat
14:13 🔗 Kaz who's responsible for that?
14:40 🔗 sirvy I made it. Feel free to try it/suggest improvements/ask questions
14:48 🔗 kiska I think the thing with git annex is it was written for this use case, I remember Jason saying something about this
14:50 🔗 kiska I believe the very first thing we have to do is take a census of what data is at the IA. And then see what we can do about it, given there is like ~60PB of data at the IA(I am pulling a number out of my arse)
14:51 🔗 kiskaWee has joined #internetarchive.bak
14:51 🔗 kiska sets mode: +o kiskaWee
14:59 🔗 sirvy The "2018.03-ia_identifiers" census lists 33M items. A few days ago I scraped the "_files.xml" of 100,000 items (so for each item, I have the file list, their sizes and whether they're private or not)
15:01 🔗 sirvy Unfortunately, most of the web crawls made by IA (as opposed to those made by users or by ArchiveTeam) are unavailable for download by the general public
15:08 🔗 sirvy Ideally in the future, it would be great if IA could provide at least an encrypted version of them. Volunteers could help with storage space, and a few trusted parties (eg. IA) would retain the keys (a separate one for each item, so that individual items can be made public if/when IA considers it appropriate to do so)
15:17 🔗 sirvy @kiska Fortunately, the size distribution is very wide among items. Size (smaller), age (oldest), uniqueness (web crawls) can be used to prioritize which items to back up first. I'm pretty sure most people would consider a petabyte of the web from the 90s to be more historically valuable than a petabyte from the Google+ scrape...
19:16 🔗 Pixi` has joined #internetarchive.bak
19:22 🔗 atphoenix has quit IRC (Read error: Operation timed out)
19:22 🔗 atphoenix has joined #internetarchive.bak
19:25 🔗 Pixi has quit IRC (Read error: Operation timed out)

irclogger-viewer