#internetarchive.bak 2015-03-06,Fri

↑back Search

Time Nickname Message
00:27 🔗 enkiv2 has quit (Ping timeout: 606 seconds)
00:38 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
00:39 🔗 svchfoo1 gives channel operator status to Start
00:54 🔗 enkiv2 (~john@[redacted]) has joined #internetarchive.bak
01:56 🔗 jake1 has quit (Read error: Operation timed out)
02:00 🔗 Start has quit (Read error: Connection reset by peer)
02:00 🔗 Start_ (~Start@[redacted]) has joined #internetarchive.bak
02:00 🔗 Start_ is now known as Start
02:01 🔗 svchfoo1 gives channel operator status to Start
02:32 🔗 kaizoku (~kaizoku@[redacted]) has joined #internetarchive.bak
02:47 🔗 DFJustin has quit (Ping timeout: 258 seconds)
02:47 🔗 DFJustin (~justin@[redacted]) has joined #internetarchive.bak
02:50 🔗 DFJustin has quit (Client Quit)
02:50 🔗 DopefishJ (DopefishJu@[redacted]) has joined #internetarchive.bak
02:50 🔗 yhager has quit (Ping timeout: 258 seconds)
02:50 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
02:50 🔗 DopefishJ is now known as DFJustin
02:50 🔗 svchfoo2 gives channel operator status to DFJustin
02:55 🔗 GauntletW has quit (Read error: Operation timed out)
02:56 🔗 GauntletW (~ted@[redacted]) has joined #internetarchive.bak
02:57 🔗 yhager has quit (Ping timeout: 258 seconds)
02:57 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
03:01 🔗 GauntletW has quit (hub.efnet.us irc.Prison.NET)
03:01 🔗 yhager has quit (hub.efnet.us irc.Prison.NET)
03:07 🔗 GauntletW (~ted@[redacted]) has joined #internetarchive.bak
03:07 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
03:18 🔗 GauntletW has quit (hub.efnet.us irc.Prison.NET)
03:18 🔗 yhager has quit (hub.efnet.us irc.Prison.NET)
03:59 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
05:05 🔗 joeyh runs stats on prelinger
05:12 🔗 joeyh SketchCow: while I think torrents has a lot going for it in simplicity, git-annex (or ipfs) seems better to me at attracting users.
05:13 🔗 joeyh Most of the big collections have a manageable number of items in them (10-100k). And unlike torrents, the others allow adding new items
05:14 🔗 joeyh if you like GD, or computer mags, or whatever, getting new ones automatically is pretty rad
05:16 🔗 joeyh btw, doesn't the IA have download stations for drop-in visitors? I really must get out there physically one day
05:17 🔗 DFJustin bittorrent sync allows adding new files
05:24 🔗 trs80 is bt sync open source though?
05:24 🔗 Ctrl-S AFAIK no
05:25 🔗 trs80 http://syncthing.net/ might be an alternative that is
05:28 🔗 jake1 (~Adium@[redacted]) has joined #internetarchive.bak
05:29 🔗 joeyh ipfs is rather similar to bittorrent sync, more decentralized, but many of the same technologies
05:42 🔗 SketchCow OK, so.
05:42 🔗 SketchCow The hackernews drop caused a lot of people to come at me.
05:42 🔗 SketchCow Some are taking it wayyyyyy too seriously, and some consider it to be official IA.
05:43 🔗 SketchCow Translation: We're heading along anyway, and I am favoring git-annex, but people will try to emotionally/logically blackmail into other solutions.
05:46 🔗 pikhq Because random pet project is clearly superior.
05:51 🔗 SketchCow Well, THIS is the random pet project.
05:51 🔗 SketchCow It might become more, but for now, I want to work with joeyh on this and everyone else too.
05:51 🔗 joeyh SketchCow: I like the idea of just letting people implement demo systems handling one standard starter dataset, like prelinger, and evaluate
05:51 🔗 joeyh if more than 1 group wants to
05:51 🔗 SketchCow I want to progress with you, as we find The Problems
05:52 🔗 SketchCow And make sure the wiki shows The Problems
05:52 🔗 SketchCow And also to see if this reveals Problems within IA's own infrastructure
05:53 🔗 SketchCow So, first, we ALL agree. The Census.
05:53 🔗 SketchCow Gotta know what's being backed up.
05:54 🔗 SketchCow So, in that way, we know: 14,926,080 items.
05:54 🔗 SketchCow Different than the 24 million we had. That 14,926,080 is the amount of items that are public and indexed and downloadable.
05:54 🔗 joeyh are we going to get a file count per item?
05:54 🔗 SketchCow Yes, he's building a massive list of everything.
05:55 🔗 SketchCow This already betrayed bugs and issues in his reporter, so he's taking a little bit of time.
05:55 🔗 joeyh that's an interesting delta btw :)
05:55 🔗 SketchCow So this is already paying dividends.
05:55 🔗 SketchCow You mean from 24 million?
05:55 🔗 joeyh yeah
05:55 🔗 SketchCow Well, some are not indexed. Some are dark, and some are system items.
05:55 🔗 joeyh is wayback machine data in this?
05:55 🔗 SketchCow Spam will be dark, for example, and we get a lot of spam.
05:55 🔗 SketchCow I don't know.
05:57 🔗 xmc and there are items which are visible but not downloadable, like the not-public-domain texts
05:59 🔗 SketchCow Alright, out of the 14,926,080 indexed items I dumped from the metadata table on 2015-03-04T20:53:57, I was able to successfully scan through 14,921,581 (I'm still sorting out the issues with the remaining 4,508 items) items.
05:59 🔗 SketchCow Out of those 14 million or so items, all of the non-derivative files add up to 14225047435566359 bytes.
05:59 🔗 SketchCow 14.23 petabytes.
05:59 🔗 SketchCow See? Nothing.
05:59 🔗 SketchCow goes down to Best Buy
06:00 🔗 xmc i got a hundred bucks, should cover my share
06:00 🔗 joeyh adds a new plan: wait 10 years and go to best buy
06:06 🔗 SketchCow So, Jake tells me he is compressing the JSON collection of information on the files in the 14,920,000 so that can be downloaded and analyzed.
06:06 🔗 joeyh so, I'm running a simulation with git-annex and dummy data, 10k files, 100 clients, just to get some real numbers about how big the git repo grows when git-annex is tracking all those clients's activity
06:07 🔗 S[h]O[r]T has quit (Read error: Operation timed out)
06:08 🔗 SketchCow He verifies it takes about 10 hours to generate The List.
06:12 🔗 joeyh looks like the git repo will be 17mb after all 100 clients download a random ~300 files each and report back about the files they have
06:21 🔗 joeyh let's see how much it will grow if the clients all report back once a month to confirm they still have data..
06:22 🔗 joeyh 1 mb per month!
06:22 🔗 joeyh or less, I didh't get exact numbers. but, that's great news
06:22 🔗 joeyh yay for git's delta compression, it's so awesome
06:23 🔗 SketchCow joeyh: What's a good e-mail address for you?
06:23 🔗 SketchCow Or mail jscott@archive.org if you don't want these maniacs having it
06:24 🔗 joeyh id@joeyh.name
06:24 🔗 joeyh so, we can run for years, with clients reporting back every month, and get a git repo under 100 mb
06:24 🔗 joeyh and it will hold the full history of where every file was on every client, every month
06:25 🔗 joeyh we can probably handle repos with 10x as many files, given these numbers..
06:26 🔗 joeyh or, scale to 1000 clients
06:26 🔗 joeyh per shard, that is
06:26 🔗 xmc cooool
06:27 🔗 joeyh with thousands of shards, we could have a million+ different drives involved in this, and it seems it would scale ok, at least as far as the tracking overhead
06:29 🔗 joeyh (also, "git-annex forget" can drop the old historical data, if it did become too large)
06:30 🔗 joeyh will write up a script to do this simulation reproducible, but for now, bottom of http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation
06:39 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
06:45 🔗 bzc6p has quit (Ping timeout: 600 seconds)
06:49 🔗 edward_ (~edward@[redacted]) has joined #internetarchive.bak
06:54 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
06:54 🔗 svchfoo1 gives channel operator status to db48x
06:57 🔗 tychotith (~tychotith@[redacted]) has joined #internetarchive.bak
07:09 🔗 db48x has quit (Read error: Connection reset by peer)
07:10 🔗 db48x2 (~user@[redacted]) has joined #internetarchive.bak
07:11 🔗 db48x2 is now known as db48x-the
07:11 🔗 db48x-the is now known as db48x2
07:15 🔗 SketchCow http://mamedev.org/downloader.php?file=releases/mame01.zip
07:15 🔗 SketchCow Wait
07:15 🔗 SketchCow https://archive.org/details/ia-bak-census_20150304
07:15 🔗 SketchCow joeyh: There you go
07:15 🔗 joeyh nice
07:15 🔗 joeyh but someone else will need to work on census stuff, I'm off to write a roguelike in 7 days
07:16 🔗 joeyh 24x7 coding babyee
07:16 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
07:17 🔗 svchfoo2 gives channel operator status to db48x
07:19 🔗 db48x2 has quit (Quit: brb)
07:19 🔗 db48x has quit (Quit: ERC Version 5.3 (IRC client for Emacs))
07:20 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
07:20 🔗 joeyh wow 8 gb of json
07:21 🔗 ersi wow 8gb of jason
07:22 🔗 svchfoo1 gives channel operator status to db48x
07:22 🔗 svchfoo2 gives channel operator status to db48x
07:35 🔗 joeyh wants to know how many total filenames are listed in that json
07:35 🔗 db48x jsawk?
07:39 🔗 joeyh here's the script I'm using to simulate using git-annex at scale http://tmp.kitenet.net/git-annex-growth-test.sh
07:43 🔗 db48x neat
07:43 🔗 db48x how long does that take to run?
07:43 🔗 joeyh an hour or so
07:49 🔗 espes___ (~espes@[redacted]) has joined #internetarchive.bak
07:50 🔗 espes___ *KNEE-JERK SKEPTICISM*
07:52 🔗 db48x what's the growth look like?
07:53 🔗 db48x oh, you put it in the wiki :)
07:55 🔗 db48x amazing how this looks doable, but Valhalla didn't
07:56 🔗 espes___ but I will just point out, that 20PB in 1 year is a quater of IA's network capacity continuously :P
07:57 🔗 joeyh course we have no idea if enough people will join or how long to get enough
08:11 🔗 db48x is there an easy way to check which version of git annex I have installed?
08:12 🔗 joeyh git annex version
08:12 🔗 joeyh and that script needs a fairly new one, btw
08:12 🔗 db48x ah
08:14 🔗 db48x I was doing git annex --version
08:53 🔗 midas1 is now known as midas
09:45 🔗 X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
09:55 🔗 bzc6p_ is now known as bzc6p
10:31 🔗 edward_ has quit (Ping timeout: 512 seconds)
11:16 🔗 edward_ (~edward@[redacted]) has joined #internetarchive.bak
12:16 🔗 edward_ has quit (Ping timeout: 512 seconds)
12:22 🔗 S[h]O[r]T (~ShOrT@[redacted]) has joined #internetarchive.bak
12:26 🔗 jake1 has quit (Quit: Leaving.)
12:39 🔗 S[h]O[r]T has quit ()
12:57 🔗 edward_ (~edward@[redacted]) has joined #internetarchive.bak
13:22 🔗 nicoo joeyh: You are git-annex's dev, right?
13:23 🔗 nicoo I was wondering how realistic it would be to try to handle all of IA over git-annex, given that last time I used it, I had noticeable trouble scaling to hundreds of GB
13:54 🔗 VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
13:57 🔗 midas im kinda worried how we could handle bitflips on the storage point of view, having a couple of hundreds nodes with TB's of data it will happen at a certain moment that one drive will bitflip. how will we checksum this?
13:59 🔗 midas also, will it be storred in containers or dumps of readable data?
14:17 🔗 edward_ has quit (Ping timeout: 512 seconds)
14:40 🔗 joeyh nicoo: if you had difficulty scaling to hundreds of GB, I'd suspect you had many small files.
14:40 🔗 yhager has quit (Read error: Connection reset by peer)
14:40 🔗 joeyh I have personal git-annex repos that are > 10 tb, and the only limit on scaling with large files is total number of files
14:41 🔗 joeyh and total amount of disk space
14:41 🔗 joeyh midas: it needs to checksum everything every so often
14:43 🔗 joeyh checksumming 500gb takes a while, anyone want to run the numbers for different likely types of storage and cpus?
14:44 🔗 nicoo joeyh: Lots of FLAC files, sized in the tens of MB. It was a while ago, though, so there might have been improvements
14:44 🔗 joeyh note, we can have clients periodically announce the files they think they still have. They could announce every month, even if it took them a year to checksum
14:45 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
14:45 🔗 joeyh so we can detect clients that drop out reasonably quickly, and more rate bits that flip less quickly
14:47 🔗 nicoo joeyh: The checksumming shouldn't be mandatory for the client, though. For instance, I operate ZFS pools and run scrubs regularly (basically, checking the block-levels checksums), so I know my data wasn't subjected to bit-roty
14:49 🔗 joeyh sure, checksumming is just a way for a client to be sure it knows what it knows. If it has other ways to know, it can just tell us
14:51 🔗 nicoo nods
14:51 🔗 midas oh and the most important thing
14:51 🔗 midas we need a leaderboard
14:53 🔗 bzc6p if we suppose that bad-acting is (kind of) excluded based on trust or any other method that has been discussed here recently,
14:53 🔗 bzc6p we could just ask (or build in) regular checksumming.
14:53 🔗 bzc6p Maybe a not-too-strong one is enough, isn't it? (As it's not against bad acting, but for checking integrity.)
14:53 🔗 joeyh yeah, that's what my git-annex design calls for, and it doesn't have proof of storage
14:54 🔗 bzc6p So we could choose a less computation intensive.
14:54 🔗 bzc6p OR
14:54 🔗 bzc6p maybe we could add some kind of ECC (error correction code)
14:55 🔗 Ctrl-S why not both?
14:55 🔗 bzc6p One is computation-intensive, other uses more storage
14:55 🔗 joeyh adding ecc data would be good.. anyone know a tool that can do it alongside the original unmodified file though?
14:57 🔗 bzc6p There must be a lot.
14:57 🔗 bzc6p For example, for optical media there is dvdisaster
14:57 🔗 bzc6p The same method could be applied
14:57 🔗 bzc6p it's open source
15:01 🔗 bzc6p There must be several such tools, anyway.
15:06 🔗 bzc6p thinks that he overestimated the number of such tools
15:08 🔗 joeyh well, find one that works well, and it can be added to the ingestion process, and could be used client-side to recover files that git-annex throws out due to checksum failure
15:11 🔗 edward_ (~edward@[redacted]) has joined #internetarchive.bak
15:12 🔗 joeyh btw, we need to decide which platforms clients run on
15:13 🔗 joeyh git-annex is linux/os/windows.. I keep finding annoying bugs in the windows port though
15:13 🔗 Ctrl-S Windows 3.1
15:13 🔗 Ctrl-S Have to support older hardware
15:13 🔗 Ctrl-S :P
15:14 🔗 Ctrl-S Can you run it in a VM like the warrior?
15:14 🔗 joeyh a docker image seems more sensible.. because with docker, it can probably access the disk they want to use
15:15 🔗 joeyh also, I think that OSX runs docker images in a linux emulator, which might make it easier. Dunno if windows can do the same yet
15:21 🔗 sep332 docker is pretty linux-specific at this point
15:23 🔗 joeyh hmm, I heard OSX supported it
15:30 🔗 sep332 looks like a wrapper around virtualbox http://docs.docker.com/installation/mac/
15:32 🔗 Start has quit (Disconnected.)
15:34 🔗 bzc6p As for ECC, I've found Parchive
15:34 🔗 bzc6p which is a "system"
15:34 🔗 bzc6p there is a linux commandline tool par2
15:35 🔗 bzc6p and
15:36 🔗 bzc6p several other software for several operating systems
15:36 🔗 bzc6p (according to Wikipedia)
15:36 🔗 bzc6p I've played a bit with PyPar2 (linux)
15:36 🔗 bzc6p ECC generation, with the default settings (and 15% redundancy) seems to be a bit slow
15:37 🔗 bzc6p but there are several settings
15:37 🔗 bzc6p People here are much more expert than me, further investigation I leave it up to you
15:39 🔗 joeyh nono, the way it works is you find something reasonable and you put it in the wiki
15:39 🔗 joeyh the the more expert person puts something better in :)
15:40 🔗 bzc6p I consider myself unworthy to put anything in the corresponding wiki page
15:42 🔗 bzc6p sighs
15:42 🔗 bzc6p okay
15:46 🔗 bzc6p added
15:52 🔗 ivan` http://chuchusoft.com/par2_tbb/ is the optimized par2
15:53 🔗 ivan` I run it with 1%-5% depending on how big the input is
16:02 🔗 Sanqui has quit (Quit: .)
16:05 🔗 sep332 par2 would let a client rebuild a small amount of damage locally, without having to re-fetch a whole 500GB block?
16:05 🔗 SketchCow Boop
16:05 🔗 SketchCow ha ha par
16:05 🔗 SketchCow PARRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
16:05 🔗 SketchCow Saves CD-ROM .isos, HD movies, and Internet Archive.
16:10 🔗 sep332 reed-solomon is my homeboy
16:22 🔗 SketchCow par was always magical to me
16:23 🔗 SketchCow As for adding stuff to the wiki, like bzc6p, the whole POINT is for people to drop a bunch of stuff in there.
16:23 🔗 SketchCow I'm shooting down some stuff for the tests we're running, but other people can run tests and it's always good to have a nice set of information up there.
16:29 🔗 Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak
16:35 🔗 sep332 have we talked about deduplication? how does IA even handle that?
16:37 🔗 SketchCow It doesn't.
16:37 🔗 SketchCow https://twitter.com/danieldrucker/status/573884557860143104
16:37 🔗 sep332 ok. i realize block-level would be crazy, but at least each file has a hash right?
16:39 🔗 SketchCow Still loving Par
16:39 🔗 SketchCow Anyway, so, first, I want to see this working prototype.
16:40 🔗 SketchCow And in doing so, we're going to discover all sorts of things.
16:40 🔗 SketchCow One thing is how certain things, like the Prelinger Archive, are not that big!
16:44 🔗 jake1 (~Adium@[redacted]) has joined #internetarchive.bak
16:50 🔗 xmc so. scoreboarding.
16:51 🔗 xmc ( bytes * days retained ) / bandwidth used
16:51 🔗 joeyh gunzipping this json and it's alread 32 gb.. I wonder how large it will be
16:51 🔗 xmc properly reward for not having to redownload
16:51 🔗 xmc zcat|grep :P
16:52 🔗 WubTheCap SketchCow: Query
16:52 🔗 joeyh aha, only 34 gb
16:53 🔗 joeyh starts a stupid grep to count files w/o actually parsing the json properly
16:53 🔗 xmc i love processing structured text with unix tools
16:53 🔗 xmc xml? 'split' and i got this.
16:54 🔗 joeyh that'll take half an hour, according to pv
16:56 🔗 SketchCow jake1 is my co-worker, by the way. He's written all the ia interaction tools, including the python internetarchive
16:56 🔗 SketchCow WubTheCap: what
17:01 🔗 DFJustin par is fucking sorcery
17:23 🔗 yhager has quit (Read error: Connection reset by peer)
17:27 🔗 yhager (~yuval@[redacted]) has joined #internetarchive.bak
17:28 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
17:38 🔗 jake1 has quit (Quit: Leaving.)
17:46 🔗 Start has quit (Disconnected.)
17:49 🔗 espes___ joeyh: `jq`!
17:52 🔗 sep332 espes___: nice!
17:54 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
17:55 🔗 joeyh IA: 271694965 files
17:56 🔗 joeyh so, that's good news
17:56 🔗 joeyh only 1 order of magnitude more files than items
18:11 🔗 SketchCow So, one minor note. The census file includes stream_only files
18:11 🔗 SketchCow Which means they can't be downloaded. So I just darked the item for jake to fix.
18:12 🔗 SketchCow See, it's all these little bumps that should be accounted for.
18:12 🔗 SketchCow 271 million original files, joeyh?
18:40 🔗 underscor (~quassel@[redacted]) has joined #internetarchive.bak
18:42 🔗 Start has quit (Disconnected.)
18:50 🔗 jake1 (~Adium@[redacted]) has joined #internetarchive.bak
18:52 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
18:57 🔗 joeyh SketchCow: yes, original files, that's all the census lists, IIRC
18:57 🔗 bzc6p has quit (Ping timeout: 600 seconds)
18:58 🔗 sep332 has quit (Read error: Connection reset by peer)
18:58 🔗 joeyh hmm, in my experience, stream_only files can be downloaded, if you know how
19:08 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
19:16 🔗 db48x yes, you just need to know the url
19:16 🔗 jake2 (~Adium@[redacted]) has joined #internetarchive.bak
19:17 🔗 jake1 has quit (Read error: Operation timed out)
19:24 🔗 DFJustin another wrinkle that will come up at some point, if you upload a torrent to IA, it downloads the files from the torrent but it marks them as derivatives and the .torrent is the only "original" file
19:25 🔗 fenn in what world does that make any sense?
19:25 🔗 DFJustin the whole thing is kind of hacked together
19:26 🔗 fenn processes generating derivative files should never involve network activity
19:31 🔗 WubTheCap has quit (Quit: Leaving)
19:32 🔗 joeyh ouch
19:35 🔗 xmc yeah. that is a big big wart.
19:37 🔗 db48x heh
19:43 🔗 yipdw is that really a big wart? sounds like you can fix that by downloading derivatives for torrents
19:44 🔗 yipdw I mean sure a special case, but whatever
19:44 🔗 joeyh yeah, true.
19:45 🔗 DFJustin well it would be nice to have it changed on the IA side eventually because it also stops them from deriving audio/video/etc files from the torrent
19:45 🔗 yipdw oh yeah definitely
19:45 🔗 yipdw just insofar as backup goes
19:45 🔗 joeyh so, with 271 files, if we wanted to not tar up an Item's files, that would mean increasing the git repo shard size from 10k to 100k, or the number of shards to 24000
19:45 🔗 SketchCow Boop.
19:46 🔗 garyrh Doesn't seem to be that many: https://archive.org/search.php?query=source%3A%28torrent%29
19:46 🔗 SketchCow So, Jake is redoing the census. The numbers will shrink.
19:46 🔗 joeyh somehow, 24000 git repos seems harder to deal with than 2400 of them
19:46 🔗 SketchCow A bit.
19:46 🔗 joeyh er, that's 271 million files of course. 271 would be slightly easier
19:47 🔗 db48x :)
19:47 🔗 joeyh not tarring up an Item's files has some nice features. like, git-annex could be told the regular IA url for the file, and would download it straight from the IA over http
19:47 🔗 joeyh rather than needing to keep the content temporarily on a ssh server
19:48 🔗 joeyh (makes the git repo a big bigger of course, but probably less than you'd think thanks to delta compression)
19:49 🔗 joeyh 100 thousand files per git repo is manageable, it's just getting sorta close to the unmanageable zone
19:49 🔗 db48x putting 100k items in a single directory would be annoying
19:49 🔗 joeyh well, think $repo/$item/$file
19:49 🔗 joeyh or $repo/$collection/$item/$file
19:50 🔗 db48x better
19:50 🔗 joeyh on balance, I'm inclined toward 100k items in the repo, 2400 repos, and http download right from IA
19:50 🔗 joeyh or, 4800 repos of 50k each
19:51 🔗 joeyh so, I think the next step is to build a list of the url to every file listed in their census
19:52 🔗 SketchCow So, this drill has been VERY helpful for us all about the Census.
19:52 🔗 SketchCow We've ripped out a bunch of items for this class.
19:55 🔗 joeyh (along with the item and collection that the url is part of)
19:56 🔗 joeyh oh yeah, their json has md5 for the files
19:56 🔗 SketchCow Another set just went out
19:56 🔗 joeyh not the greatest checksum. git-annex can use it, but bad actors could always find a checksum collision and use it to overwrite files
19:57 🔗 joeyh but, if git-annex doesn't reuse that md5sum, we have to somehow sha256 all the files when generating the git-annex repo
19:58 🔗 joeyh got cut off, what was the last thing I saif?
19:59 🔗 garyrh <joeyh> but, if git-annex doesn't reuse that md5sum, we have to somehow sha256 all the files when generating the git-annex repo
19:59 🔗 joeyh yeah, that's all
20:25 🔗 joeyh SketchCow: maybe ask your guys if there's any chance they could add a sha256 of every file. Eventually..
20:26 🔗 joeyh notes there are ways to read a file and generate a md5 and sha256 at the same time. So if they periodically check the md5s, they could get the shas almost for free
20:26 🔗 joeyh they = the IA
20:38 🔗 SketchCow joeyh: I flung it at them
20:38 🔗 SketchCow I'll keep restating, but let's barrel forward with the flaws
20:38 🔗 SketchCow And then note the flaws and see if they can be fixed
20:42 🔗 joeyh so, I can write a script that takes a file with lines like "<bytes> <checksum> <collection> <item> <file> <url>" and spits out, quite quickly, a git-annex repsitory
20:43 🔗 joeyh totally out of time as far as generating such files for now though
20:43 🔗 SketchCow I just keep repeating myself, just because I don't like to see things like this blow up because people go "it's not perfect"
20:43 🔗 joeyh 7drl awaits
20:43 🔗 SketchCow 7drl?
20:43 🔗 joeyh writing a rougelike in 7 days
20:43 🔗 SketchCow From Hank:
20:43 🔗 SketchCow all files do already have sha1's, which are less collision-prone than md5's. would that be adequate?
20:44 🔗 joeyh it would be less inadequate
20:44 🔗 joeyh much less
20:44 🔗 joeyh :)
20:44 🔗 joeyh so yes plz, sha1s
20:44 🔗 SketchCow Would you consider the issue closed, and "make it 256 some time in the future"
20:44 🔗 SketchCow Well, they're there.
20:44 🔗 SketchCow They've been there, they're there.
20:44 🔗 joeyh I think so. practical sha1 atttacks have not yet been demonstrated
20:45 🔗 joeyh also, if users can break sha1, they can break **git**
20:45 🔗 joeyh if they break git, and we're using git-annex, we have bigger probems
20:45 🔗 SketchCow Well, then, switch to the sha1
20:45 🔗 joeyh also, we should switch to archiving github, if sha1 is broken, before it melts down into a puddle
20:45 🔗 SketchCow So you're out for the count for a week?
20:46 🔗 joeyh yep, 8 to 16 hour days writing a game
20:46 🔗 joeyh and then I'm in boston for a week, but a little more available
20:46 🔗 SketchCow https://www.schneier.com/blog/archives/2005/02/sha1_broken.html
20:47 🔗 SketchCow :)
20:47 🔗 SketchCow Anyway, tracey says use md5sum AND sha-1
20:47 🔗 joeyh yes, but ... no.
20:47 🔗 joeyh that paper, afaik, has never been published, or the results have not been replicated
20:48 🔗 joeyh or, it wasn't good enough for practical collisions yet, just a reduction from "impossibly hard" to "convert the sun to rackmount computers" hard
20:48 🔗 joeyh A 2011 attack by Marc Stevens can produce hash collisions with a complexity between 260.3 and 265.3 operations.[1] No actual collisions have yet been produced. -- wiki
20:49 🔗 joeyh that's meant to be 2^60
20:49 🔗 joeyh so, it was 2^69 in 2005, and 2^60 in 2011.. we can see where this is going
20:51 🔗 SketchCow Well, understood, joeyh - of course we're going to keep doing the project but your bit will have to wait until you're back
20:51 🔗 joeyh "estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers"
20:51 🔗 joeyh man, I so want someone to do that
20:51 🔗 yipdw $5 million if you use AWS
20:52 🔗 joeyh having 2 files that sha1 the same would be very useful ;)
20:53 🔗 joeyh suggests they find something that sha1s to 4b825dc642cb6eb9a060e54bf8d69288fbee4904
20:53 🔗 joeyh that's the git empty tree hash
20:54 🔗 joeyh so, break git: $2.77M
20:54 🔗 joeyh backup IA: $???
20:54 🔗 fenn $400k in 1.5TB tapes
20:54 🔗 SketchCow https://twitter.com/danieldrucker/status/573948564608577537
20:57 🔗 yipdw good to know they're pulling out the money card
20:57 🔗 joeyh I see it right their next to their asshole card
20:58 🔗 SketchCow http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK&curid=6055&diff=22174&oldid=22172
20:58 🔗 SketchCow I actually know this technique.
20:58 🔗 SketchCow It's a technique used by alpha PUAs
20:58 🔗 SketchCow Works in sales
20:59 🔗 SketchCow Why would you walk away from _______
20:59 🔗 yipdw !ao https://twitter.com/danieldrucker/status/573880074732191744
20:59 🔗 yipdw oops
21:00 🔗 SketchCow I assume he's going to propose:
21:00 🔗 SketchCow - Working with someone who has a tape drive somewhere, and put IA on those tapes.
21:00 🔗 SketchCow And not propose:
21:00 🔗 SketchCow - Sending a free tape drive to the archive, and tapes
21:02 🔗 SketchCow Got the mail.
21:02 🔗 SketchCow It's the first.
21:02 🔗 SketchCow Sorry, not alpha PUA.
21:02 🔗 SketchCow My apologies.
21:02 🔗 SketchCow Academic.
21:03 🔗 SketchCow Looks the same if you squint.
21:03 🔗 SketchCow Somewhere in your network of contacts there has to be either someone at Oracle, or someone at a large computing center, who could donate the T10000D drives for your temporary use.
21:03 🔗 SketchCow That's his "helping you to get access to several hundred thousand dollars of resources"
21:04 🔗 SketchCow Telling me I should ask around for several hundred thousand dollars of resources.
21:06 🔗 garyrh I have a spare $200k in the back of my Tesla.
21:07 🔗 SketchCow Oh, wait, he's making calls.
21:10 🔗 yipdw garyrh: true cool cats keep their $200k in the frunk
21:13 🔗 garyrh http://i.imgflip.com/dk9r.gif
21:26 🔗 SketchCow OK, I've put him over to the admins
21:26 🔗 SketchCow the IDEA is fine.
21:26 🔗 SketchCow The APPROACH is also fine
21:39 🔗 SketchCow Oh thank god DFJustin fixed the animation
21:40 🔗 SketchCow 08:47, 6 March 2015 CRITICAL NEED TO USE TAPE, FULL EXPLANATION
21:40 🔗 SketchCow 16:18, 6 March 2015 burning animation fixed
21:41 🔗 DFJustin :D
21:43 🔗 SketchCow I'm just a little sensitive
21:43 🔗 SketchCow To "why dick around with [solution a] when [solution b] is staring you in the face"
21:43 🔗 SketchCow Also "I asked for people to give shit for free" == "I am getting you shit for free"
21:44 🔗 SketchCow Anyway, so, tasks for Jason or someone else
21:44 🔗 SketchCow - Take what's been discussed in here, get on wiki
21:44 🔗 SketchCow - split up wiki pages into more wiki pages, this stuff's getting large
21:45 🔗 SketchCow - continue work on census
21:45 🔗 SketchCow - get real numbers from census
21:48 🔗 SketchCow http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK now has appropriate gif
21:59 🔗 DFJustin you might wanna have an asterisk that it's actually in two places, for the literal crowd
22:11 🔗 SketchCow We're down to 13,075,201 items.
22:11 🔗 SketchCow (In the census)
23:15 🔗 Start (~Start@[redacted]) has joined #internetarchive.bak
23:16 🔗 svchfoo2 gives channel operator status to Start
23:34 🔗 zottelbey has quit (Remote host closed the connection)
23:51 🔗 edward_ has quit (Ping timeout: 512 seconds)

irclogger-viewer