#internetarchive.bak 2015-03-06,Fri

↑back Search

Time	Nickname	Message
00:27 ^🔗		enkiv2 has quit (Ping timeout: 606 seconds)
00:38 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
00:39 ^🔗		svchfoo1 gives channel operator status to Start
00:54 ^🔗		enkiv2 (~john@[redacted]) has joined #internetarchive.bak
01:56 ^🔗		jake1 has quit (Read error: Operation timed out)
02:00 ^🔗		Start has quit (Read error: Connection reset by peer)
02:00 ^🔗		Start_ (~Start@[redacted]) has joined #internetarchive.bak
02:00 ^🔗		Start_ is now known as Start
02:01 ^🔗		svchfoo1 gives channel operator status to Start
02:32 ^🔗		kaizoku (~kaizoku@[redacted]) has joined #internetarchive.bak
02:47 ^🔗		DFJustin has quit (Ping timeout: 258 seconds)
02:47 ^🔗		DFJustin (~justin@[redacted]) has joined #internetarchive.bak
02:50 ^🔗		DFJustin has quit (Client Quit)
02:50 ^🔗		DopefishJ (DopefishJu@[redacted]) has joined #internetarchive.bak
02:50 ^🔗		yhager has quit (Ping timeout: 258 seconds)
02:50 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
02:50 ^🔗		DopefishJ is now known as DFJustin
02:50 ^🔗		svchfoo2 gives channel operator status to DFJustin
02:55 ^🔗		GauntletW has quit (Read error: Operation timed out)
02:56 ^🔗		GauntletW (~ted@[redacted]) has joined #internetarchive.bak
02:57 ^🔗		yhager has quit (Ping timeout: 258 seconds)
02:57 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
03:01 ^🔗		GauntletW has quit (hub.efnet.us irc.Prison.NET)
03:01 ^🔗		yhager has quit (hub.efnet.us irc.Prison.NET)
03:07 ^🔗		GauntletW (~ted@[redacted]) has joined #internetarchive.bak
03:07 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
03:18 ^🔗		GauntletW has quit (hub.efnet.us irc.Prison.NET)
03:18 ^🔗		yhager has quit (hub.efnet.us irc.Prison.NET)
03:59 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
05:05 ^🔗		joeyh runs stats on prelinger
05:12 ^🔗	joeyh	SketchCow: while I think torrents has a lot going for it in simplicity, git-annex (or ipfs) seems better to me at attracting users.
05:13 ^🔗	joeyh	Most of the big collections have a manageable number of items in them (10-100k). And unlike torrents, the others allow adding new items
05:14 ^🔗	joeyh	if you like GD, or computer mags, or whatever, getting new ones automatically is pretty rad
05:16 ^🔗	joeyh	btw, doesn't the IA have download stations for drop-in visitors? I really must get out there physically one day
05:17 ^🔗	DFJustin	bittorrent sync allows adding new files
05:24 ^🔗	trs80	is bt sync open source though?
05:24 ^🔗	Ctrl-S	AFAIK no
05:25 ^🔗	trs80	http://syncthing.net/ might be an alternative that is
05:28 ^🔗		jake1 (~Adium@[redacted]) has joined #internetarchive.bak
05:29 ^🔗	joeyh	ipfs is rather similar to bittorrent sync, more decentralized, but many of the same technologies
05:42 ^🔗	SketchCow	OK, so.
05:42 ^🔗	SketchCow	The hackernews drop caused a lot of people to come at me.
05:42 ^🔗	SketchCow	Some are taking it wayyyyyy too seriously, and some consider it to be official IA.
05:43 ^🔗	SketchCow	Translation: We're heading along anyway, and I am favoring git-annex, but people will try to emotionally/logically blackmail into other solutions.
05:46 ^🔗	pikhq	Because random pet project is clearly superior.
05:51 ^🔗	SketchCow	Well, THIS is the random pet project.
05:51 ^🔗	SketchCow	It might become more, but for now, I want to work with joeyh on this and everyone else too.
05:51 ^🔗	joeyh	SketchCow: I like the idea of just letting people implement demo systems handling one standard starter dataset, like prelinger, and evaluate
05:51 ^🔗	joeyh	if more than 1 group wants to
05:51 ^🔗	SketchCow	I want to progress with you, as we find The Problems
05:52 ^🔗	SketchCow	And make sure the wiki shows The Problems
05:52 ^🔗	SketchCow	And also to see if this reveals Problems within IA's own infrastructure
05:53 ^🔗	SketchCow	So, first, we ALL agree. The Census.
05:53 ^🔗	SketchCow	Gotta know what's being backed up.
05:54 ^🔗	SketchCow	So, in that way, we know: 14,926,080 items.
05:54 ^🔗	SketchCow	Different than the 24 million we had. That 14,926,080 is the amount of items that are public and indexed and downloadable.
05:54 ^🔗	joeyh	are we going to get a file count per item?
05:54 ^🔗	SketchCow	Yes, he's building a massive list of everything.
05:55 ^🔗	SketchCow	This already betrayed bugs and issues in his reporter, so he's taking a little bit of time.
05:55 ^🔗	joeyh	that's an interesting delta btw :)
05:55 ^🔗	SketchCow	So this is already paying dividends.
05:55 ^🔗	SketchCow	You mean from 24 million?
05:55 ^🔗	joeyh	yeah
05:55 ^🔗	SketchCow	Well, some are not indexed. Some are dark, and some are system items.
05:55 ^🔗	joeyh	is wayback machine data in this?
05:55 ^🔗	SketchCow	Spam will be dark, for example, and we get a lot of spam.
05:55 ^🔗	SketchCow	I don't know.
05:57 ^🔗	xmc	and there are items which are visible but not downloadable, like the not-public-domain texts
05:59 ^🔗	SketchCow	Alright, out of the 14,926,080 indexed items I dumped from the metadata table on 2015-03-04T20:53:57, I was able to successfully scan through 14,921,581 (I'm still sorting out the issues with the remaining 4,508 items) items.
05:59 ^🔗	SketchCow	Out of those 14 million or so items, all of the non-derivative files add up to 14225047435566359 bytes.
05:59 ^🔗	SketchCow	14.23 petabytes.
05:59 ^🔗	SketchCow	See? Nothing.
05:59 ^🔗		SketchCow goes down to Best Buy
06:00 ^🔗	xmc	i got a hundred bucks, should cover my share
06:00 ^🔗		joeyh adds a new plan: wait 10 years and go to best buy
06:06 ^🔗	SketchCow	So, Jake tells me he is compressing the JSON collection of information on the files in the 14,920,000 so that can be downloaded and analyzed.
06:06 ^🔗	joeyh	so, I'm running a simulation with git-annex and dummy data, 10k files, 100 clients, just to get some real numbers about how big the git repo grows when git-annex is tracking all those clients's activity
06:07 ^🔗		S[h]O[r]T has quit (Read error: Operation timed out)
06:08 ^🔗	SketchCow	He verifies it takes about 10 hours to generate The List.
06:12 ^🔗	joeyh	looks like the git repo will be 17mb after all 100 clients download a random ~300 files each and report back about the files they have
06:21 ^🔗	joeyh	let's see how much it will grow if the clients all report back once a month to confirm they still have data..
06:22 ^🔗	joeyh	1 mb per month!
06:22 ^🔗	joeyh	or less, I didh't get exact numbers. but, that's great news
06:22 ^🔗	joeyh	yay for git's delta compression, it's so awesome
06:23 ^🔗	SketchCow	joeyh: What's a good e-mail address for you?
06:23 ^🔗	SketchCow	Or mail jscott@archive.org if you don't want these maniacs having it
06:24 ^🔗	joeyh	id@joeyh.name
06:24 ^🔗	joeyh	so, we can run for years, with clients reporting back every month, and get a git repo under 100 mb
06:24 ^🔗	joeyh	and it will hold the full history of where every file was on every client, every month
06:25 ^🔗	joeyh	we can probably handle repos with 10x as many files, given these numbers..
06:26 ^🔗	joeyh	or, scale to 1000 clients
06:26 ^🔗	joeyh	per shard, that is
06:26 ^🔗	xmc	cooool
06:27 ^🔗	joeyh	with thousands of shards, we could have a million+ different drives involved in this, and it seems it would scale ok, at least as far as the tracking overhead
06:29 ^🔗	joeyh	(also, "git-annex forget" can drop the old historical data, if it did become too large)
06:30 ^🔗		joeyh will write up a script to do this simulation reproducible, but for now, bottom of http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation
06:39 ^🔗		bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
06:45 ^🔗		bzc6p has quit (Ping timeout: 600 seconds)
06:49 ^🔗		edward_ (~edward@[redacted]) has joined #internetarchive.bak
06:54 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
06:54 ^🔗		svchfoo1 gives channel operator status to db48x
06:57 ^🔗		tychotith (~tychotith@[redacted]) has joined #internetarchive.bak
07:09 ^🔗		db48x has quit (Read error: Connection reset by peer)
07:10 ^🔗		db48x2 (~user@[redacted]) has joined #internetarchive.bak
07:11 ^🔗		db48x2 is now known as db48x-the
07:11 ^🔗		db48x-the is now known as db48x2
07:15 ^🔗	SketchCow	http://mamedev.org/downloader.php?file=releases/mame01.zip
07:15 ^🔗	SketchCow	Wait
07:15 ^🔗	SketchCow	https://archive.org/details/ia-bak-census_20150304
07:15 ^🔗	SketchCow	joeyh: There you go
07:15 ^🔗	joeyh	nice
07:15 ^🔗	joeyh	but someone else will need to work on census stuff, I'm off to write a roguelike in 7 days
07:16 ^🔗	joeyh	24x7 coding babyee
07:16 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
07:17 ^🔗		svchfoo2 gives channel operator status to db48x
07:19 ^🔗		db48x2 has quit (Quit: brb)
07:19 ^🔗		db48x has quit (Quit: ERC Version 5.3 (IRC client for Emacs))
07:20 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
07:20 ^🔗	joeyh	wow 8 gb of json
07:21 ^🔗	ersi	wow 8gb of jason
07:22 ^🔗		svchfoo1 gives channel operator status to db48x
07:22 ^🔗		svchfoo2 gives channel operator status to db48x
07:35 ^🔗		joeyh wants to know how many total filenames are listed in that json
07:35 ^🔗	db48x	jsawk?
07:39 ^🔗	joeyh	here's the script I'm using to simulate using git-annex at scale http://tmp.kitenet.net/git-annex-growth-test.sh
07:43 ^🔗	db48x	neat
07:43 ^🔗	db48x	how long does that take to run?
07:43 ^🔗	joeyh	an hour or so
07:49 ^🔗		espes___ (~espes@[redacted]) has joined #internetarchive.bak
07:50 ^🔗	espes___	KNEE-JERK SKEPTICISM
07:52 ^🔗	db48x	what's the growth look like?
07:53 ^🔗	db48x	oh, you put it in the wiki :)
07:55 ^🔗	db48x	amazing how this looks doable, but Valhalla didn't
07:56 ^🔗	espes___	but I will just point out, that 20PB in 1 year is a quater of IA's network capacity continuously :P
07:57 ^🔗	joeyh	course we have no idea if enough people will join or how long to get enough
08:11 ^🔗	db48x	is there an easy way to check which version of git annex I have installed?
08:12 ^🔗	joeyh	git annex version
08:12 ^🔗	joeyh	and that script needs a fairly new one, btw
08:12 ^🔗	db48x	ah
08:14 ^🔗	db48x	I was doing git annex --version
08:53 ^🔗		midas1 is now known as midas
09:45 ^🔗		X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak
09:55 ^🔗		bzc6p_ is now known as bzc6p
10:31 ^🔗		edward_ has quit (Ping timeout: 512 seconds)
11:16 ^🔗		edward_ (~edward@[redacted]) has joined #internetarchive.bak
12:16 ^🔗		edward_ has quit (Ping timeout: 512 seconds)
12:22 ^🔗		S[h]O[r]T (~ShOrT@[redacted]) has joined #internetarchive.bak
12:26 ^🔗		jake1 has quit (Quit: Leaving.)
12:39 ^🔗		S[h]O[r]T has quit ()
12:57 ^🔗		edward_ (~edward@[redacted]) has joined #internetarchive.bak
13:22 ^🔗	nicoo	joeyh: You are git-annex's dev, right?
13:23 ^🔗	nicoo	I was wondering how realistic it would be to try to handle all of IA over git-annex, given that last time I used it, I had noticeable trouble scaling to hundreds of GB
13:54 ^🔗		VADemon (~VADemon@[redacted]) has joined #internetarchive.bak
13:57 ^🔗	midas	im kinda worried how we could handle bitflips on the storage point of view, having a couple of hundreds nodes with TB's of data it will happen at a certain moment that one drive will bitflip. how will we checksum this?
13:59 ^🔗	midas	also, will it be storred in containers or dumps of readable data?
14:17 ^🔗		edward_ has quit (Ping timeout: 512 seconds)
14:40 ^🔗	joeyh	nicoo: if you had difficulty scaling to hundreds of GB, I'd suspect you had many small files.
14:40 ^🔗		yhager has quit (Read error: Connection reset by peer)
14:40 ^🔗	joeyh	I have personal git-annex repos that are > 10 tb, and the only limit on scaling with large files is total number of files
14:41 ^🔗	joeyh	and total amount of disk space
14:41 ^🔗	joeyh	midas: it needs to checksum everything every so often
14:43 ^🔗	joeyh	checksumming 500gb takes a while, anyone want to run the numbers for different likely types of storage and cpus?
14:44 ^🔗	nicoo	joeyh: Lots of FLAC files, sized in the tens of MB. It was a while ago, though, so there might have been improvements
14:44 ^🔗	joeyh	note, we can have clients periodically announce the files they think they still have. They could announce every month, even if it took them a year to checksum
14:45 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
14:45 ^🔗	joeyh	so we can detect clients that drop out reasonably quickly, and more rate bits that flip less quickly
14:47 ^🔗	nicoo	joeyh: The checksumming shouldn't be mandatory for the client, though. For instance, I operate ZFS pools and run scrubs regularly (basically, checking the block-levels checksums), so I know my data wasn't subjected to bit-roty
14:49 ^🔗	joeyh	sure, checksumming is just a way for a client to be sure it knows what it knows. If it has other ways to know, it can just tell us
14:51 ^🔗		nicoo nods
14:51 ^🔗	midas	oh and the most important thing
14:51 ^🔗	midas	we need a leaderboard
14:53 ^🔗	bzc6p	if we suppose that bad-acting is (kind of) excluded based on trust or any other method that has been discussed here recently,
14:53 ^🔗	bzc6p	we could just ask (or build in) regular checksumming.
14:53 ^🔗	bzc6p	Maybe a not-too-strong one is enough, isn't it? (As it's not against bad acting, but for checking integrity.)
14:53 ^🔗	joeyh	yeah, that's what my git-annex design calls for, and it doesn't have proof of storage
14:54 ^🔗	bzc6p	So we could choose a less computation intensive.
14:54 ^🔗	bzc6p	OR
14:54 ^🔗	bzc6p	maybe we could add some kind of ECC (error correction code)
14:55 ^🔗	Ctrl-S	why not both?
14:55 ^🔗	bzc6p	One is computation-intensive, other uses more storage
14:55 ^🔗	joeyh	adding ecc data would be good.. anyone know a tool that can do it alongside the original unmodified file though?
14:57 ^🔗	bzc6p	There must be a lot.
14:57 ^🔗	bzc6p	For example, for optical media there is dvdisaster
14:57 ^🔗	bzc6p	The same method could be applied
14:57 ^🔗	bzc6p	it's open source
15:01 ^🔗	bzc6p	There must be several such tools, anyway.
15:06 ^🔗		bzc6p thinks that he overestimated the number of such tools
15:08 ^🔗	joeyh	well, find one that works well, and it can be added to the ingestion process, and could be used client-side to recover files that git-annex throws out due to checksum failure
15:11 ^🔗		edward_ (~edward@[redacted]) has joined #internetarchive.bak
15:12 ^🔗	joeyh	btw, we need to decide which platforms clients run on
15:13 ^🔗	joeyh	git-annex is linux/os/windows.. I keep finding annoying bugs in the windows port though
15:13 ^🔗	Ctrl-S	Windows 3.1
15:13 ^🔗	Ctrl-S	Have to support older hardware
15:13 ^🔗	Ctrl-S	:P
15:14 ^🔗	Ctrl-S	Can you run it in a VM like the warrior?
15:14 ^🔗	joeyh	a docker image seems more sensible.. because with docker, it can probably access the disk they want to use
15:15 ^🔗	joeyh	also, I think that OSX runs docker images in a linux emulator, which might make it easier. Dunno if windows can do the same yet
15:21 ^🔗	sep332	docker is pretty linux-specific at this point
15:23 ^🔗	joeyh	hmm, I heard OSX supported it
15:30 ^🔗	sep332	looks like a wrapper around virtualbox http://docs.docker.com/installation/mac/
15:32 ^🔗		Start has quit (Disconnected.)
15:34 ^🔗	bzc6p	As for ECC, I've found Parchive
15:34 ^🔗	bzc6p	which is a "system"
15:34 ^🔗	bzc6p	there is a linux commandline tool par2
15:35 ^🔗	bzc6p	and
15:36 ^🔗	bzc6p	several other software for several operating systems
15:36 ^🔗	bzc6p	(according to Wikipedia)
15:36 ^🔗	bzc6p	I've played a bit with PyPar2 (linux)
15:36 ^🔗	bzc6p	ECC generation, with the default settings (and 15% redundancy) seems to be a bit slow
15:37 ^🔗	bzc6p	but there are several settings
15:37 ^🔗	bzc6p	People here are much more expert than me, further investigation I leave it up to you
15:39 ^🔗	joeyh	nono, the way it works is you find something reasonable and you put it in the wiki
15:39 ^🔗	joeyh	the the more expert person puts something better in :)
15:40 ^🔗	bzc6p	I consider myself unworthy to put anything in the corresponding wiki page
15:42 ^🔗		bzc6p sighs
15:42 ^🔗	bzc6p	okay
15:46 ^🔗	bzc6p	added
15:52 ^🔗	ivan`	http://chuchusoft.com/par2_tbb/ is the optimized par2
15:53 ^🔗	ivan`	I run it with 1%-5% depending on how big the input is
16:02 ^🔗		Sanqui has quit (Quit: .)
16:05 ^🔗	sep332	par2 would let a client rebuild a small amount of damage locally, without having to re-fetch a whole 500GB block?
16:05 ^🔗	SketchCow	Boop
16:05 ^🔗	SketchCow	ha ha par
16:05 ^🔗	SketchCow	PARRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
16:05 ^🔗	SketchCow	Saves CD-ROM .isos, HD movies, and Internet Archive.
16:10 ^🔗	sep332	reed-solomon is my homeboy
16:22 ^🔗	SketchCow	par was always magical to me
16:23 ^🔗	SketchCow	As for adding stuff to the wiki, like bzc6p, the whole POINT is for people to drop a bunch of stuff in there.
16:23 ^🔗	SketchCow	I'm shooting down some stuff for the tests we're running, but other people can run tests and it's always good to have a nice set of information up there.
16:29 ^🔗		Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak
16:35 ^🔗	sep332	have we talked about deduplication? how does IA even handle that?
16:37 ^🔗	SketchCow	It doesn't.
16:37 ^🔗	SketchCow	https://twitter.com/danieldrucker/status/573884557860143104
16:37 ^🔗	sep332	ok. i realize block-level would be crazy, but at least each file has a hash right?
16:39 ^🔗	SketchCow	Still loving Par
16:39 ^🔗	SketchCow	Anyway, so, first, I want to see this working prototype.
16:40 ^🔗	SketchCow	And in doing so, we're going to discover all sorts of things.
16:40 ^🔗	SketchCow	One thing is how certain things, like the Prelinger Archive, are not that big!
16:44 ^🔗		jake1 (~Adium@[redacted]) has joined #internetarchive.bak
16:50 ^🔗	xmc	so. scoreboarding.
16:51 ^🔗	xmc	( bytes * days retained ) / bandwidth used
16:51 ^🔗	joeyh	gunzipping this json and it's alread 32 gb.. I wonder how large it will be
16:51 ^🔗	xmc	properly reward for not having to redownload
16:51 ^🔗	xmc	zcat\|grep :P
16:52 ^🔗	WubTheCap	SketchCow: Query
16:52 ^🔗	joeyh	aha, only 34 gb
16:53 ^🔗		joeyh starts a stupid grep to count files w/o actually parsing the json properly
16:53 ^🔗	xmc	i love processing structured text with unix tools
16:53 ^🔗	xmc	xml? 'split' and i got this.
16:54 ^🔗	joeyh	that'll take half an hour, according to pv
16:56 ^🔗	SketchCow	jake1 is my co-worker, by the way. He's written all the ia interaction tools, including the python internetarchive
16:56 ^🔗	SketchCow	WubTheCap: what
17:01 ^🔗	DFJustin	par is fucking sorcery
17:23 ^🔗		yhager has quit (Read error: Connection reset by peer)
17:27 ^🔗		yhager (~yuval@[redacted]) has joined #internetarchive.bak
17:28 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
17:38 ^🔗		jake1 has quit (Quit: Leaving.)
17:46 ^🔗		Start has quit (Disconnected.)
17:49 ^🔗	espes___	joeyh: `jq`!
17:52 ^🔗	sep332	espes___: nice!
17:54 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
17:55 ^🔗	joeyh	IA: 271694965 files
17:56 ^🔗	joeyh	so, that's good news
17:56 ^🔗	joeyh	only 1 order of magnitude more files than items
18:11 ^🔗	SketchCow	So, one minor note. The census file includes stream_only files
18:11 ^🔗	SketchCow	Which means they can't be downloaded. So I just darked the item for jake to fix.
18:12 ^🔗	SketchCow	See, it's all these little bumps that should be accounted for.
18:12 ^🔗	SketchCow	271 million original files, joeyh?
18:40 ^🔗		underscor (~quassel@[redacted]) has joined #internetarchive.bak
18:42 ^🔗		Start has quit (Disconnected.)
18:50 ^🔗		jake1 (~Adium@[redacted]) has joined #internetarchive.bak
18:52 ^🔗		bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
18:57 ^🔗	joeyh	SketchCow: yes, original files, that's all the census lists, IIRC
18:57 ^🔗		bzc6p has quit (Ping timeout: 600 seconds)
18:58 ^🔗		sep332 has quit (Read error: Connection reset by peer)
18:58 ^🔗	joeyh	hmm, in my experience, stream_only files can be downloaded, if you know how
19:08 ^🔗		zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
19:16 ^🔗	db48x	yes, you just need to know the url
19:16 ^🔗		jake2 (~Adium@[redacted]) has joined #internetarchive.bak
19:17 ^🔗		jake1 has quit (Read error: Operation timed out)
19:24 ^🔗	DFJustin	another wrinkle that will come up at some point, if you upload a torrent to IA, it downloads the files from the torrent but it marks them as derivatives and the .torrent is the only "original" file
19:25 ^🔗	fenn	in what world does that make any sense?
19:25 ^🔗	DFJustin	the whole thing is kind of hacked together
19:26 ^🔗	fenn	processes generating derivative files should never involve network activity
19:31 ^🔗		WubTheCap has quit (Quit: Leaving)
19:32 ^🔗	joeyh	ouch
19:35 ^🔗	xmc	yeah. that is a big big wart.
19:37 ^🔗	db48x	heh
19:43 ^🔗	yipdw	is that really a big wart? sounds like you can fix that by downloading derivatives for torrents
19:44 ^🔗	yipdw	I mean sure a special case, but whatever
19:44 ^🔗	joeyh	yeah, true.
19:45 ^🔗	DFJustin	well it would be nice to have it changed on the IA side eventually because it also stops them from deriving audio/video/etc files from the torrent
19:45 ^🔗	yipdw	oh yeah definitely
19:45 ^🔗	yipdw	just insofar as backup goes
19:45 ^🔗	joeyh	so, with 271 files, if we wanted to not tar up an Item's files, that would mean increasing the git repo shard size from 10k to 100k, or the number of shards to 24000
19:45 ^🔗	SketchCow	Boop.
19:46 ^🔗	garyrh	Doesn't seem to be that many: https://archive.org/search.php?query=source%3A%28torrent%29
19:46 ^🔗	SketchCow	So, Jake is redoing the census. The numbers will shrink.
19:46 ^🔗	joeyh	somehow, 24000 git repos seems harder to deal with than 2400 of them
19:46 ^🔗	SketchCow	A bit.
19:46 ^🔗	joeyh	er, that's 271 million files of course. 271 would be slightly easier
19:47 ^🔗	db48x	:)
19:47 ^🔗	joeyh	not tarring up an Item's files has some nice features. like, git-annex could be told the regular IA url for the file, and would download it straight from the IA over http
19:47 ^🔗	joeyh	rather than needing to keep the content temporarily on a ssh server
19:48 ^🔗	joeyh	(makes the git repo a big bigger of course, but probably less than you'd think thanks to delta compression)
19:49 ^🔗	joeyh	100 thousand files per git repo is manageable, it's just getting sorta close to the unmanageable zone
19:49 ^🔗	db48x	putting 100k items in a single directory would be annoying
19:49 ^🔗	joeyh	well, think $repo/$item/$file
19:49 ^🔗	joeyh	or $repo/$collection/$item/$file
19:50 ^🔗	db48x	better
19:50 ^🔗	joeyh	on balance, I'm inclined toward 100k items in the repo, 2400 repos, and http download right from IA
19:50 ^🔗	joeyh	or, 4800 repos of 50k each
19:51 ^🔗	joeyh	so, I think the next step is to build a list of the url to every file listed in their census
19:52 ^🔗	SketchCow	So, this drill has been VERY helpful for us all about the Census.
19:52 ^🔗	SketchCow	We've ripped out a bunch of items for this class.
19:55 ^🔗	joeyh	(along with the item and collection that the url is part of)
19:56 ^🔗	joeyh	oh yeah, their json has md5 for the files
19:56 ^🔗	SketchCow	Another set just went out
19:56 ^🔗	joeyh	not the greatest checksum. git-annex can use it, but bad actors could always find a checksum collision and use it to overwrite files
19:57 ^🔗	joeyh	but, if git-annex doesn't reuse that md5sum, we have to somehow sha256 all the files when generating the git-annex repo
19:58 ^🔗	joeyh	got cut off, what was the last thing I saif?
19:59 ^🔗	garyrh	<joeyh> but, if git-annex doesn't reuse that md5sum, we have to somehow sha256 all the files when generating the git-annex repo
19:59 ^🔗	joeyh	yeah, that's all
20:25 ^🔗	joeyh	SketchCow: maybe ask your guys if there's any chance they could add a sha256 of every file. Eventually..
20:26 ^🔗		joeyh notes there are ways to read a file and generate a md5 and sha256 at the same time. So if they periodically check the md5s, they could get the shas almost for free
20:26 ^🔗	joeyh	they = the IA
20:38 ^🔗	SketchCow	joeyh: I flung it at them
20:38 ^🔗	SketchCow	I'll keep restating, but let's barrel forward with the flaws
20:38 ^🔗	SketchCow	And then note the flaws and see if they can be fixed
20:42 ^🔗	joeyh	so, I can write a script that takes a file with lines like "<bytes> <checksum> <collection> <item> <file> <url>" and spits out, quite quickly, a git-annex repsitory
20:43 ^🔗	joeyh	totally out of time as far as generating such files for now though
20:43 ^🔗	SketchCow	I just keep repeating myself, just because I don't like to see things like this blow up because people go "it's not perfect"
20:43 ^🔗	joeyh	7drl awaits
20:43 ^🔗	SketchCow	7drl?
20:43 ^🔗	joeyh	writing a rougelike in 7 days
20:43 ^🔗	SketchCow	From Hank:
20:43 ^🔗	SketchCow	all files do already have sha1's, which are less collision-prone than md5's. would that be adequate?
20:44 ^🔗	joeyh	it would be less inadequate
20:44 ^🔗	joeyh	much less
20:44 ^🔗	joeyh	:)
20:44 ^🔗	joeyh	so yes plz, sha1s
20:44 ^🔗	SketchCow	Would you consider the issue closed, and "make it 256 some time in the future"
20:44 ^🔗	SketchCow	Well, they're there.
20:44 ^🔗	SketchCow	They've been there, they're there.
20:44 ^🔗	joeyh	I think so. practical sha1 atttacks have not yet been demonstrated
20:45 ^🔗	joeyh	also, if users can break sha1, they can break git
20:45 ^🔗	joeyh	if they break git, and we're using git-annex, we have bigger probems
20:45 ^🔗	SketchCow	Well, then, switch to the sha1
20:45 ^🔗	joeyh	also, we should switch to archiving github, if sha1 is broken, before it melts down into a puddle
20:45 ^🔗	SketchCow	So you're out for the count for a week?
20:46 ^🔗	joeyh	yep, 8 to 16 hour days writing a game
20:46 ^🔗	joeyh	and then I'm in boston for a week, but a little more available
20:46 ^🔗	SketchCow	https://www.schneier.com/blog/archives/2005/02/sha1_broken.html
20:47 ^🔗	SketchCow	:)
20:47 ^🔗	SketchCow	Anyway, tracey says use md5sum AND sha-1
20:47 ^🔗	joeyh	yes, but ... no.
20:47 ^🔗	joeyh	that paper, afaik, has never been published, or the results have not been replicated
20:48 ^🔗	joeyh	or, it wasn't good enough for practical collisions yet, just a reduction from "impossibly hard" to "convert the sun to rackmount computers" hard
20:48 ^🔗	joeyh	A 2011 attack by Marc Stevens can produce hash collisions with a complexity between 260.3 and 265.3 operations.[1] No actual collisions have yet been produced. -- wiki
20:49 ^🔗	joeyh	that's meant to be 2^60
20:49 ^🔗	joeyh	so, it was 2^69 in 2005, and 2^60 in 2011.. we can see where this is going
20:51 ^🔗	SketchCow	Well, understood, joeyh - of course we're going to keep doing the project but your bit will have to wait until you're back
20:51 ^🔗	joeyh	"estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers"
20:51 ^🔗	joeyh	man, I so want someone to do that
20:51 ^🔗	yipdw	$5 million if you use AWS
20:52 ^🔗	joeyh	having 2 files that sha1 the same would be very useful ;)
20:53 ^🔗		joeyh suggests they find something that sha1s to 4b825dc642cb6eb9a060e54bf8d69288fbee4904
20:53 ^🔗	joeyh	that's the git empty tree hash
20:54 ^🔗	joeyh	so, break git: $2.77M
20:54 ^🔗	joeyh	backup IA: $???
20:54 ^🔗	fenn	$400k in 1.5TB tapes
20:54 ^🔗	SketchCow	https://twitter.com/danieldrucker/status/573948564608577537
20:57 ^🔗	yipdw	good to know they're pulling out the money card
20:57 ^🔗	joeyh	I see it right their next to their asshole card
20:58 ^🔗	SketchCow	http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK&curid=6055&diff=22174&oldid=22172
20:58 ^🔗	SketchCow	I actually know this technique.
20:58 ^🔗	SketchCow	It's a technique used by alpha PUAs
20:58 ^🔗	SketchCow	Works in sales
20:59 ^🔗	SketchCow	Why would you walk away from _______
20:59 ^🔗	yipdw	!ao https://twitter.com/danieldrucker/status/573880074732191744
20:59 ^🔗	yipdw	oops
21:00 ^🔗	SketchCow	I assume he's going to propose:
21:00 ^🔗	SketchCow	- Working with someone who has a tape drive somewhere, and put IA on those tapes.
21:00 ^🔗	SketchCow	And not propose:
21:00 ^🔗	SketchCow	- Sending a free tape drive to the archive, and tapes
21:02 ^🔗	SketchCow	Got the mail.
21:02 ^🔗	SketchCow	It's the first.
21:02 ^🔗	SketchCow	Sorry, not alpha PUA.
21:02 ^🔗	SketchCow	My apologies.
21:02 ^🔗	SketchCow	Academic.
21:03 ^🔗	SketchCow	Looks the same if you squint.
21:03 ^🔗	SketchCow	Somewhere in your network of contacts there has to be either someone at Oracle, or someone at a large computing center, who could donate the T10000D drives for your temporary use.
21:03 ^🔗	SketchCow	That's his "helping you to get access to several hundred thousand dollars of resources"
21:04 ^🔗	SketchCow	Telling me I should ask around for several hundred thousand dollars of resources.
21:06 ^🔗	garyrh	I have a spare $200k in the back of my Tesla.
21:07 ^🔗	SketchCow	Oh, wait, he's making calls.
21:10 ^🔗	yipdw	garyrh: true cool cats keep their $200k in the frunk
21:13 ^🔗	garyrh	http://i.imgflip.com/dk9r.gif
21:26 ^🔗	SketchCow	OK, I've put him over to the admins
21:26 ^🔗	SketchCow	the IDEA is fine.
21:26 ^🔗	SketchCow	The APPROACH is also fine
21:39 ^🔗	SketchCow	Oh thank god DFJustin fixed the animation
21:40 ^🔗	SketchCow	08:47, 6 March 2015 CRITICAL NEED TO USE TAPE, FULL EXPLANATION
21:40 ^🔗	SketchCow	16:18, 6 March 2015 burning animation fixed
21:41 ^🔗	DFJustin	:D
21:43 ^🔗	SketchCow	I'm just a little sensitive
21:43 ^🔗	SketchCow	To "why dick around with [solution a] when [solution b] is staring you in the face"
21:43 ^🔗	SketchCow	Also "I asked for people to give shit for free" == "I am getting you shit for free"
21:44 ^🔗	SketchCow	Anyway, so, tasks for Jason or someone else
21:44 ^🔗	SketchCow	- Take what's been discussed in here, get on wiki
21:44 ^🔗	SketchCow	- split up wiki pages into more wiki pages, this stuff's getting large
21:45 ^🔗	SketchCow	- continue work on census
21:45 ^🔗	SketchCow	- get real numbers from census
21:48 ^🔗	SketchCow	http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK now has appropriate gif
21:59 ^🔗	DFJustin	you might wanna have an asterisk that it's actually in two places, for the literal crowd
22:11 ^🔗	SketchCow	We're down to 13,075,201 items.
22:11 ^🔗	SketchCow	(In the census)
23:15 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
23:16 ^🔗		svchfoo2 gives channel operator status to Start
23:34 ^🔗		zottelbey has quit (Remote host closed the connection)
23:51 ^🔗		edward_ has quit (Ping timeout: 512 seconds)

irclogger-viewer