Time |
Nickname |
Message |
00:27
🔗
|
|
enkiv2 has quit (Ping timeout: 606 seconds) |
00:38
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
00:39
🔗
|
|
svchfoo1 gives channel operator status to Start |
00:54
🔗
|
|
enkiv2 (~john@[redacted]) has joined #internetarchive.bak |
01:56
🔗
|
|
jake1 has quit (Read error: Operation timed out) |
02:00
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
02:00
🔗
|
|
Start_ (~Start@[redacted]) has joined #internetarchive.bak |
02:00
🔗
|
|
Start_ is now known as Start |
02:01
🔗
|
|
svchfoo1 gives channel operator status to Start |
02:32
🔗
|
|
kaizoku (~kaizoku@[redacted]) has joined #internetarchive.bak |
02:47
🔗
|
|
DFJustin has quit (Ping timeout: 258 seconds) |
02:47
🔗
|
|
DFJustin (~justin@[redacted]) has joined #internetarchive.bak |
02:50
🔗
|
|
DFJustin has quit (Client Quit) |
02:50
🔗
|
|
DopefishJ (DopefishJu@[redacted]) has joined #internetarchive.bak |
02:50
🔗
|
|
yhager has quit (Ping timeout: 258 seconds) |
02:50
🔗
|
|
yhager (~yuval@[redacted]) has joined #internetarchive.bak |
02:50
🔗
|
|
DopefishJ is now known as DFJustin |
02:50
🔗
|
|
svchfoo2 gives channel operator status to DFJustin |
02:55
🔗
|
|
GauntletW has quit (Read error: Operation timed out) |
02:56
🔗
|
|
GauntletW (~ted@[redacted]) has joined #internetarchive.bak |
02:57
🔗
|
|
yhager has quit (Ping timeout: 258 seconds) |
02:57
🔗
|
|
yhager (~yuval@[redacted]) has joined #internetarchive.bak |
03:01
🔗
|
|
GauntletW has quit (hub.efnet.us irc.Prison.NET) |
03:01
🔗
|
|
yhager has quit (hub.efnet.us irc.Prison.NET) |
03:07
🔗
|
|
GauntletW (~ted@[redacted]) has joined #internetarchive.bak |
03:07
🔗
|
|
yhager (~yuval@[redacted]) has joined #internetarchive.bak |
03:18
🔗
|
|
GauntletW has quit (hub.efnet.us irc.Prison.NET) |
03:18
🔗
|
|
yhager has quit (hub.efnet.us irc.Prison.NET) |
03:59
🔗
|
|
yhager (~yuval@[redacted]) has joined #internetarchive.bak |
05:05
🔗
|
|
joeyh runs stats on prelinger |
05:12
🔗
|
joeyh |
SketchCow: while I think torrents has a lot going for it in simplicity, git-annex (or ipfs) seems better to me at attracting users. |
05:13
🔗
|
joeyh |
Most of the big collections have a manageable number of items in them (10-100k). And unlike torrents, the others allow adding new items |
05:14
🔗
|
joeyh |
if you like GD, or computer mags, or whatever, getting new ones automatically is pretty rad |
05:16
🔗
|
joeyh |
btw, doesn't the IA have download stations for drop-in visitors? I really must get out there physically one day |
05:17
🔗
|
DFJustin |
bittorrent sync allows adding new files |
05:24
🔗
|
trs80 |
is bt sync open source though? |
05:24
🔗
|
Ctrl-S |
AFAIK no |
05:25
🔗
|
trs80 |
http://syncthing.net/ might be an alternative that is |
05:28
🔗
|
|
jake1 (~Adium@[redacted]) has joined #internetarchive.bak |
05:29
🔗
|
joeyh |
ipfs is rather similar to bittorrent sync, more decentralized, but many of the same technologies |
05:42
🔗
|
SketchCow |
OK, so. |
05:42
🔗
|
SketchCow |
The hackernews drop caused a lot of people to come at me. |
05:42
🔗
|
SketchCow |
Some are taking it wayyyyyy too seriously, and some consider it to be official IA. |
05:43
🔗
|
SketchCow |
Translation: We're heading along anyway, and I am favoring git-annex, but people will try to emotionally/logically blackmail into other solutions. |
05:46
🔗
|
pikhq |
Because random pet project is clearly superior. |
05:51
🔗
|
SketchCow |
Well, THIS is the random pet project. |
05:51
🔗
|
SketchCow |
It might become more, but for now, I want to work with joeyh on this and everyone else too. |
05:51
🔗
|
joeyh |
SketchCow: I like the idea of just letting people implement demo systems handling one standard starter dataset, like prelinger, and evaluate |
05:51
🔗
|
joeyh |
if more than 1 group wants to |
05:51
🔗
|
SketchCow |
I want to progress with you, as we find The Problems |
05:52
🔗
|
SketchCow |
And make sure the wiki shows The Problems |
05:52
🔗
|
SketchCow |
And also to see if this reveals Problems within IA's own infrastructure |
05:53
🔗
|
SketchCow |
So, first, we ALL agree. The Census. |
05:53
🔗
|
SketchCow |
Gotta know what's being backed up. |
05:54
🔗
|
SketchCow |
So, in that way, we know: 14,926,080 items. |
05:54
🔗
|
SketchCow |
Different than the 24 million we had. That 14,926,080 is the amount of items that are public and indexed and downloadable. |
05:54
🔗
|
joeyh |
are we going to get a file count per item? |
05:54
🔗
|
SketchCow |
Yes, he's building a massive list of everything. |
05:55
🔗
|
SketchCow |
This already betrayed bugs and issues in his reporter, so he's taking a little bit of time. |
05:55
🔗
|
joeyh |
that's an interesting delta btw :) |
05:55
🔗
|
SketchCow |
So this is already paying dividends. |
05:55
🔗
|
SketchCow |
You mean from 24 million? |
05:55
🔗
|
joeyh |
yeah |
05:55
🔗
|
SketchCow |
Well, some are not indexed. Some are dark, and some are system items. |
05:55
🔗
|
joeyh |
is wayback machine data in this? |
05:55
🔗
|
SketchCow |
Spam will be dark, for example, and we get a lot of spam. |
05:55
🔗
|
SketchCow |
I don't know. |
05:57
🔗
|
xmc |
and there are items which are visible but not downloadable, like the not-public-domain texts |
05:59
🔗
|
SketchCow |
Alright, out of the 14,926,080 indexed items I dumped from the metadata table on 2015-03-04T20:53:57, I was able to successfully scan through 14,921,581 (I'm still sorting out the issues with the remaining 4,508 items) items. |
05:59
🔗
|
SketchCow |
Out of those 14 million or so items, all of the non-derivative files add up to 14225047435566359 bytes. |
05:59
🔗
|
SketchCow |
14.23 petabytes. |
05:59
🔗
|
SketchCow |
See? Nothing. |
05:59
🔗
|
|
SketchCow goes down to Best Buy |
06:00
🔗
|
xmc |
i got a hundred bucks, should cover my share |
06:00
🔗
|
|
joeyh adds a new plan: wait 10 years and go to best buy |
06:06
🔗
|
SketchCow |
So, Jake tells me he is compressing the JSON collection of information on the files in the 14,920,000 so that can be downloaded and analyzed. |
06:06
🔗
|
joeyh |
so, I'm running a simulation with git-annex and dummy data, 10k files, 100 clients, just to get some real numbers about how big the git repo grows when git-annex is tracking all those clients's activity |
06:07
🔗
|
|
S[h]O[r]T has quit (Read error: Operation timed out) |
06:08
🔗
|
SketchCow |
He verifies it takes about 10 hours to generate The List. |
06:12
🔗
|
joeyh |
looks like the git repo will be 17mb after all 100 clients download a random ~300 files each and report back about the files they have |
06:21
🔗
|
joeyh |
let's see how much it will grow if the clients all report back once a month to confirm they still have data.. |
06:22
🔗
|
joeyh |
1 mb per month! |
06:22
🔗
|
joeyh |
or less, I didh't get exact numbers. but, that's great news |
06:22
🔗
|
joeyh |
yay for git's delta compression, it's so awesome |
06:23
🔗
|
SketchCow |
joeyh: What's a good e-mail address for you? |
06:23
🔗
|
SketchCow |
Or mail jscott@archive.org if you don't want these maniacs having it |
06:24
🔗
|
joeyh |
id@joeyh.name |
06:24
🔗
|
joeyh |
so, we can run for years, with clients reporting back every month, and get a git repo under 100 mb |
06:24
🔗
|
joeyh |
and it will hold the full history of where every file was on every client, every month |
06:25
🔗
|
joeyh |
we can probably handle repos with 10x as many files, given these numbers.. |
06:26
🔗
|
joeyh |
or, scale to 1000 clients |
06:26
🔗
|
joeyh |
per shard, that is |
06:26
🔗
|
xmc |
cooool |
06:27
🔗
|
joeyh |
with thousands of shards, we could have a million+ different drives involved in this, and it seems it would scale ok, at least as far as the tracking overhead |
06:29
🔗
|
joeyh |
(also, "git-annex forget" can drop the old historical data, if it did become too large) |
06:30
🔗
|
|
joeyh will write up a script to do this simulation reproducible, but for now, bottom of http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation |
06:39
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
06:45
🔗
|
|
bzc6p has quit (Ping timeout: 600 seconds) |
06:49
🔗
|
|
edward_ (~edward@[redacted]) has joined #internetarchive.bak |
06:54
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
06:54
🔗
|
|
svchfoo1 gives channel operator status to db48x |
06:57
🔗
|
|
tychotith (~tychotith@[redacted]) has joined #internetarchive.bak |
07:09
🔗
|
|
db48x has quit (Read error: Connection reset by peer) |
07:10
🔗
|
|
db48x2 (~user@[redacted]) has joined #internetarchive.bak |
07:11
🔗
|
|
db48x2 is now known as db48x-the |
07:11
🔗
|
|
db48x-the is now known as db48x2 |
07:15
🔗
|
SketchCow |
http://mamedev.org/downloader.php?file=releases/mame01.zip |
07:15
🔗
|
SketchCow |
Wait |
07:15
🔗
|
SketchCow |
https://archive.org/details/ia-bak-census_20150304 |
07:15
🔗
|
SketchCow |
joeyh: There you go |
07:15
🔗
|
joeyh |
nice |
07:15
🔗
|
joeyh |
but someone else will need to work on census stuff, I'm off to write a roguelike in 7 days |
07:16
🔗
|
joeyh |
24x7 coding babyee |
07:16
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
07:17
🔗
|
|
svchfoo2 gives channel operator status to db48x |
07:19
🔗
|
|
db48x2 has quit (Quit: brb) |
07:19
🔗
|
|
db48x has quit (Quit: ERC Version 5.3 (IRC client for Emacs)) |
07:20
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
07:20
🔗
|
joeyh |
wow 8 gb of json |
07:21
🔗
|
ersi |
wow 8gb of jason |
07:22
🔗
|
|
svchfoo1 gives channel operator status to db48x |
07:22
🔗
|
|
svchfoo2 gives channel operator status to db48x |
07:35
🔗
|
|
joeyh wants to know how many total filenames are listed in that json |
07:35
🔗
|
db48x |
jsawk? |
07:39
🔗
|
joeyh |
here's the script I'm using to simulate using git-annex at scale http://tmp.kitenet.net/git-annex-growth-test.sh |
07:43
🔗
|
db48x |
neat |
07:43
🔗
|
db48x |
how long does that take to run? |
07:43
🔗
|
joeyh |
an hour or so |
07:49
🔗
|
|
espes___ (~espes@[redacted]) has joined #internetarchive.bak |
07:50
🔗
|
espes___ |
*KNEE-JERK SKEPTICISM* |
07:52
🔗
|
db48x |
what's the growth look like? |
07:53
🔗
|
db48x |
oh, you put it in the wiki :) |
07:55
🔗
|
db48x |
amazing how this looks doable, but Valhalla didn't |
07:56
🔗
|
espes___ |
but I will just point out, that 20PB in 1 year is a quater of IA's network capacity continuously :P |
07:57
🔗
|
joeyh |
course we have no idea if enough people will join or how long to get enough |
08:11
🔗
|
db48x |
is there an easy way to check which version of git annex I have installed? |
08:12
🔗
|
joeyh |
git annex version |
08:12
🔗
|
joeyh |
and that script needs a fairly new one, btw |
08:12
🔗
|
db48x |
ah |
08:14
🔗
|
db48x |
I was doing git annex --version |
08:53
🔗
|
|
midas1 is now known as midas |
09:45
🔗
|
|
X-Scale (~gbabios@[redacted]) has joined #internetarchive.bak |
09:55
🔗
|
|
bzc6p_ is now known as bzc6p |
10:31
🔗
|
|
edward_ has quit (Ping timeout: 512 seconds) |
11:16
🔗
|
|
edward_ (~edward@[redacted]) has joined #internetarchive.bak |
12:16
🔗
|
|
edward_ has quit (Ping timeout: 512 seconds) |
12:22
🔗
|
|
S[h]O[r]T (~ShOrT@[redacted]) has joined #internetarchive.bak |
12:26
🔗
|
|
jake1 has quit (Quit: Leaving.) |
12:39
🔗
|
|
S[h]O[r]T has quit () |
12:57
🔗
|
|
edward_ (~edward@[redacted]) has joined #internetarchive.bak |
13:22
🔗
|
nicoo |
joeyh: You are git-annex's dev, right? |
13:23
🔗
|
nicoo |
I was wondering how realistic it would be to try to handle all of IA over git-annex, given that last time I used it, I had noticeable trouble scaling to hundreds of GB |
13:54
🔗
|
|
VADemon (~VADemon@[redacted]) has joined #internetarchive.bak |
13:57
🔗
|
midas |
im kinda worried how we could handle bitflips on the storage point of view, having a couple of hundreds nodes with TB's of data it will happen at a certain moment that one drive will bitflip. how will we checksum this? |
13:59
🔗
|
midas |
also, will it be storred in containers or dumps of readable data? |
14:17
🔗
|
|
edward_ has quit (Ping timeout: 512 seconds) |
14:40
🔗
|
joeyh |
nicoo: if you had difficulty scaling to hundreds of GB, I'd suspect you had many small files. |
14:40
🔗
|
|
yhager has quit (Read error: Connection reset by peer) |
14:40
🔗
|
joeyh |
I have personal git-annex repos that are > 10 tb, and the only limit on scaling with large files is total number of files |
14:41
🔗
|
joeyh |
and total amount of disk space |
14:41
🔗
|
joeyh |
midas: it needs to checksum everything every so often |
14:43
🔗
|
joeyh |
checksumming 500gb takes a while, anyone want to run the numbers for different likely types of storage and cpus? |
14:44
🔗
|
nicoo |
joeyh: Lots of FLAC files, sized in the tens of MB. It was a while ago, though, so there might have been improvements |
14:44
🔗
|
joeyh |
note, we can have clients periodically announce the files they think they still have. They could announce every month, even if it took them a year to checksum |
14:45
🔗
|
|
yhager (~yuval@[redacted]) has joined #internetarchive.bak |
14:45
🔗
|
joeyh |
so we can detect clients that drop out reasonably quickly, and more rate bits that flip less quickly |
14:47
🔗
|
nicoo |
joeyh: The checksumming shouldn't be mandatory for the client, though. For instance, I operate ZFS pools and run scrubs regularly (basically, checking the block-levels checksums), so I know my data wasn't subjected to bit-roty |
14:49
🔗
|
joeyh |
sure, checksumming is just a way for a client to be sure it knows what it knows. If it has other ways to know, it can just tell us |
14:51
🔗
|
|
nicoo nods |
14:51
🔗
|
midas |
oh and the most important thing |
14:51
🔗
|
midas |
we need a leaderboard |
14:53
🔗
|
bzc6p |
if we suppose that bad-acting is (kind of) excluded based on trust or any other method that has been discussed here recently, |
14:53
🔗
|
bzc6p |
we could just ask (or build in) regular checksumming. |
14:53
🔗
|
bzc6p |
Maybe a not-too-strong one is enough, isn't it? (As it's not against bad acting, but for checking integrity.) |
14:53
🔗
|
joeyh |
yeah, that's what my git-annex design calls for, and it doesn't have proof of storage |
14:54
🔗
|
bzc6p |
So we could choose a less computation intensive. |
14:54
🔗
|
bzc6p |
OR |
14:54
🔗
|
bzc6p |
maybe we could add some kind of ECC (error correction code) |
14:55
🔗
|
Ctrl-S |
why not both? |
14:55
🔗
|
bzc6p |
One is computation-intensive, other uses more storage |
14:55
🔗
|
joeyh |
adding ecc data would be good.. anyone know a tool that can do it alongside the original unmodified file though? |
14:57
🔗
|
bzc6p |
There must be a lot. |
14:57
🔗
|
bzc6p |
For example, for optical media there is dvdisaster |
14:57
🔗
|
bzc6p |
The same method could be applied |
14:57
🔗
|
bzc6p |
it's open source |
15:01
🔗
|
bzc6p |
There must be several such tools, anyway. |
15:06
🔗
|
|
bzc6p thinks that he overestimated the number of such tools |
15:08
🔗
|
joeyh |
well, find one that works well, and it can be added to the ingestion process, and could be used client-side to recover files that git-annex throws out due to checksum failure |
15:11
🔗
|
|
edward_ (~edward@[redacted]) has joined #internetarchive.bak |
15:12
🔗
|
joeyh |
btw, we need to decide which platforms clients run on |
15:13
🔗
|
joeyh |
git-annex is linux/os/windows.. I keep finding annoying bugs in the windows port though |
15:13
🔗
|
Ctrl-S |
Windows 3.1 |
15:13
🔗
|
Ctrl-S |
Have to support older hardware |
15:13
🔗
|
Ctrl-S |
:P |
15:14
🔗
|
Ctrl-S |
Can you run it in a VM like the warrior? |
15:14
🔗
|
joeyh |
a docker image seems more sensible.. because with docker, it can probably access the disk they want to use |
15:15
🔗
|
joeyh |
also, I think that OSX runs docker images in a linux emulator, which might make it easier. Dunno if windows can do the same yet |
15:21
🔗
|
sep332 |
docker is pretty linux-specific at this point |
15:23
🔗
|
joeyh |
hmm, I heard OSX supported it |
15:30
🔗
|
sep332 |
looks like a wrapper around virtualbox http://docs.docker.com/installation/mac/ |
15:32
🔗
|
|
Start has quit (Disconnected.) |
15:34
🔗
|
bzc6p |
As for ECC, I've found Parchive |
15:34
🔗
|
bzc6p |
which is a "system" |
15:34
🔗
|
bzc6p |
there is a linux commandline tool par2 |
15:35
🔗
|
bzc6p |
and |
15:36
🔗
|
bzc6p |
several other software for several operating systems |
15:36
🔗
|
bzc6p |
(according to Wikipedia) |
15:36
🔗
|
bzc6p |
I've played a bit with PyPar2 (linux) |
15:36
🔗
|
bzc6p |
ECC generation, with the default settings (and 15% redundancy) seems to be a bit slow |
15:37
🔗
|
bzc6p |
but there are several settings |
15:37
🔗
|
bzc6p |
People here are much more expert than me, further investigation I leave it up to you |
15:39
🔗
|
joeyh |
nono, the way it works is you find something reasonable and you put it in the wiki |
15:39
🔗
|
joeyh |
the the more expert person puts something better in :) |
15:40
🔗
|
bzc6p |
I consider myself unworthy to put anything in the corresponding wiki page |
15:42
🔗
|
|
bzc6p sighs |
15:42
🔗
|
bzc6p |
okay |
15:46
🔗
|
bzc6p |
added |
15:52
🔗
|
ivan` |
http://chuchusoft.com/par2_tbb/ is the optimized par2 |
15:53
🔗
|
ivan` |
I run it with 1%-5% depending on how big the input is |
16:02
🔗
|
|
Sanqui has quit (Quit: .) |
16:05
🔗
|
sep332 |
par2 would let a client rebuild a small amount of damage locally, without having to re-fetch a whole 500GB block? |
16:05
🔗
|
SketchCow |
Boop |
16:05
🔗
|
SketchCow |
ha ha par |
16:05
🔗
|
SketchCow |
PARRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR |
16:05
🔗
|
SketchCow |
Saves CD-ROM .isos, HD movies, and Internet Archive. |
16:10
🔗
|
sep332 |
reed-solomon is my homeboy |
16:22
🔗
|
SketchCow |
par was always magical to me |
16:23
🔗
|
SketchCow |
As for adding stuff to the wiki, like bzc6p, the whole POINT is for people to drop a bunch of stuff in there. |
16:23
🔗
|
SketchCow |
I'm shooting down some stuff for the tests we're running, but other people can run tests and it's always good to have a nice set of information up there. |
16:29
🔗
|
|
Sanqui (~Sanky_R@[redacted]) has joined #internetarchive.bak |
16:35
🔗
|
sep332 |
have we talked about deduplication? how does IA even handle that? |
16:37
🔗
|
SketchCow |
It doesn't. |
16:37
🔗
|
SketchCow |
https://twitter.com/danieldrucker/status/573884557860143104 |
16:37
🔗
|
sep332 |
ok. i realize block-level would be crazy, but at least each file has a hash right? |
16:39
🔗
|
SketchCow |
Still loving Par |
16:39
🔗
|
SketchCow |
Anyway, so, first, I want to see this working prototype. |
16:40
🔗
|
SketchCow |
And in doing so, we're going to discover all sorts of things. |
16:40
🔗
|
SketchCow |
One thing is how certain things, like the Prelinger Archive, are not that big! |
16:44
🔗
|
|
jake1 (~Adium@[redacted]) has joined #internetarchive.bak |
16:50
🔗
|
xmc |
so. scoreboarding. |
16:51
🔗
|
xmc |
( bytes * days retained ) / bandwidth used |
16:51
🔗
|
joeyh |
gunzipping this json and it's alread 32 gb.. I wonder how large it will be |
16:51
🔗
|
xmc |
properly reward for not having to redownload |
16:51
🔗
|
xmc |
zcat|grep :P |
16:52
🔗
|
WubTheCap |
SketchCow: Query |
16:52
🔗
|
joeyh |
aha, only 34 gb |
16:53
🔗
|
|
joeyh starts a stupid grep to count files w/o actually parsing the json properly |
16:53
🔗
|
xmc |
i love processing structured text with unix tools |
16:53
🔗
|
xmc |
xml? 'split' and i got this. |
16:54
🔗
|
joeyh |
that'll take half an hour, according to pv |
16:56
🔗
|
SketchCow |
jake1 is my co-worker, by the way. He's written all the ia interaction tools, including the python internetarchive |
16:56
🔗
|
SketchCow |
WubTheCap: what |
17:01
🔗
|
DFJustin |
par is fucking sorcery |
17:23
🔗
|
|
yhager has quit (Read error: Connection reset by peer) |
17:27
🔗
|
|
yhager (~yuval@[redacted]) has joined #internetarchive.bak |
17:28
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
17:38
🔗
|
|
jake1 has quit (Quit: Leaving.) |
17:46
🔗
|
|
Start has quit (Disconnected.) |
17:49
🔗
|
espes___ |
joeyh: `jq`! |
17:52
🔗
|
sep332 |
espes___: nice! |
17:54
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
17:55
🔗
|
joeyh |
IA: 271694965 files |
17:56
🔗
|
joeyh |
so, that's good news |
17:56
🔗
|
joeyh |
only 1 order of magnitude more files than items |
18:11
🔗
|
SketchCow |
So, one minor note. The census file includes stream_only files |
18:11
🔗
|
SketchCow |
Which means they can't be downloaded. So I just darked the item for jake to fix. |
18:12
🔗
|
SketchCow |
See, it's all these little bumps that should be accounted for. |
18:12
🔗
|
SketchCow |
271 million original files, joeyh? |
18:40
🔗
|
|
underscor (~quassel@[redacted]) has joined #internetarchive.bak |
18:42
🔗
|
|
Start has quit (Disconnected.) |
18:50
🔗
|
|
jake1 (~Adium@[redacted]) has joined #internetarchive.bak |
18:52
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
18:57
🔗
|
joeyh |
SketchCow: yes, original files, that's all the census lists, IIRC |
18:57
🔗
|
|
bzc6p has quit (Ping timeout: 600 seconds) |
18:58
🔗
|
|
sep332 has quit (Read error: Connection reset by peer) |
18:58
🔗
|
joeyh |
hmm, in my experience, stream_only files can be downloaded, if you know how |
19:08
🔗
|
|
zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak |
19:16
🔗
|
db48x |
yes, you just need to know the url |
19:16
🔗
|
|
jake2 (~Adium@[redacted]) has joined #internetarchive.bak |
19:17
🔗
|
|
jake1 has quit (Read error: Operation timed out) |
19:24
🔗
|
DFJustin |
another wrinkle that will come up at some point, if you upload a torrent to IA, it downloads the files from the torrent but it marks them as derivatives and the .torrent is the only "original" file |
19:25
🔗
|
fenn |
in what world does that make any sense? |
19:25
🔗
|
DFJustin |
the whole thing is kind of hacked together |
19:26
🔗
|
fenn |
processes generating derivative files should never involve network activity |
19:31
🔗
|
|
WubTheCap has quit (Quit: Leaving) |
19:32
🔗
|
joeyh |
ouch |
19:35
🔗
|
xmc |
yeah. that is a big big wart. |
19:37
🔗
|
db48x |
heh |
19:43
🔗
|
yipdw |
is that really a big wart? sounds like you can fix that by downloading derivatives for torrents |
19:44
🔗
|
yipdw |
I mean sure a special case, but whatever |
19:44
🔗
|
joeyh |
yeah, true. |
19:45
🔗
|
DFJustin |
well it would be nice to have it changed on the IA side eventually because it also stops them from deriving audio/video/etc files from the torrent |
19:45
🔗
|
yipdw |
oh yeah definitely |
19:45
🔗
|
yipdw |
just insofar as backup goes |
19:45
🔗
|
joeyh |
so, with 271 files, if we wanted to not tar up an Item's files, that would mean increasing the git repo shard size from 10k to 100k, or the number of shards to 24000 |
19:45
🔗
|
SketchCow |
Boop. |
19:46
🔗
|
garyrh |
Doesn't seem to be that many: https://archive.org/search.php?query=source%3A%28torrent%29 |
19:46
🔗
|
SketchCow |
So, Jake is redoing the census. The numbers will shrink. |
19:46
🔗
|
joeyh |
somehow, 24000 git repos seems harder to deal with than 2400 of them |
19:46
🔗
|
SketchCow |
A bit. |
19:46
🔗
|
joeyh |
er, that's 271 million files of course. 271 would be slightly easier |
19:47
🔗
|
db48x |
:) |
19:47
🔗
|
joeyh |
not tarring up an Item's files has some nice features. like, git-annex could be told the regular IA url for the file, and would download it straight from the IA over http |
19:47
🔗
|
joeyh |
rather than needing to keep the content temporarily on a ssh server |
19:48
🔗
|
joeyh |
(makes the git repo a big bigger of course, but probably less than you'd think thanks to delta compression) |
19:49
🔗
|
joeyh |
100 thousand files per git repo is manageable, it's just getting sorta close to the unmanageable zone |
19:49
🔗
|
db48x |
putting 100k items in a single directory would be annoying |
19:49
🔗
|
joeyh |
well, think $repo/$item/$file |
19:49
🔗
|
joeyh |
or $repo/$collection/$item/$file |
19:50
🔗
|
db48x |
better |
19:50
🔗
|
joeyh |
on balance, I'm inclined toward 100k items in the repo, 2400 repos, and http download right from IA |
19:50
🔗
|
joeyh |
or, 4800 repos of 50k each |
19:51
🔗
|
joeyh |
so, I think the next step is to build a list of the url to every file listed in their census |
19:52
🔗
|
SketchCow |
So, this drill has been VERY helpful for us all about the Census. |
19:52
🔗
|
SketchCow |
We've ripped out a bunch of items for this class. |
19:55
🔗
|
joeyh |
(along with the item and collection that the url is part of) |
19:56
🔗
|
joeyh |
oh yeah, their json has md5 for the files |
19:56
🔗
|
SketchCow |
Another set just went out |
19:56
🔗
|
joeyh |
not the greatest checksum. git-annex can use it, but bad actors could always find a checksum collision and use it to overwrite files |
19:57
🔗
|
joeyh |
but, if git-annex doesn't reuse that md5sum, we have to somehow sha256 all the files when generating the git-annex repo |
19:58
🔗
|
joeyh |
got cut off, what was the last thing I saif? |
19:59
🔗
|
garyrh |
<joeyh> but, if git-annex doesn't reuse that md5sum, we have to somehow sha256 all the files when generating the git-annex repo |
19:59
🔗
|
joeyh |
yeah, that's all |
20:25
🔗
|
joeyh |
SketchCow: maybe ask your guys if there's any chance they could add a sha256 of every file. Eventually.. |
20:26
🔗
|
|
joeyh notes there are ways to read a file and generate a md5 and sha256 at the same time. So if they periodically check the md5s, they could get the shas almost for free |
20:26
🔗
|
joeyh |
they = the IA |
20:38
🔗
|
SketchCow |
joeyh: I flung it at them |
20:38
🔗
|
SketchCow |
I'll keep restating, but let's barrel forward with the flaws |
20:38
🔗
|
SketchCow |
And then note the flaws and see if they can be fixed |
20:42
🔗
|
joeyh |
so, I can write a script that takes a file with lines like "<bytes> <checksum> <collection> <item> <file> <url>" and spits out, quite quickly, a git-annex repsitory |
20:43
🔗
|
joeyh |
totally out of time as far as generating such files for now though |
20:43
🔗
|
SketchCow |
I just keep repeating myself, just because I don't like to see things like this blow up because people go "it's not perfect" |
20:43
🔗
|
joeyh |
7drl awaits |
20:43
🔗
|
SketchCow |
7drl? |
20:43
🔗
|
joeyh |
writing a rougelike in 7 days |
20:43
🔗
|
SketchCow |
From Hank: |
20:43
🔗
|
SketchCow |
all files do already have sha1's, which are less collision-prone than md5's. would that be adequate? |
20:44
🔗
|
joeyh |
it would be less inadequate |
20:44
🔗
|
joeyh |
much less |
20:44
🔗
|
joeyh |
:) |
20:44
🔗
|
joeyh |
so yes plz, sha1s |
20:44
🔗
|
SketchCow |
Would you consider the issue closed, and "make it 256 some time in the future" |
20:44
🔗
|
SketchCow |
Well, they're there. |
20:44
🔗
|
SketchCow |
They've been there, they're there. |
20:44
🔗
|
joeyh |
I think so. practical sha1 atttacks have not yet been demonstrated |
20:45
🔗
|
joeyh |
also, if users can break sha1, they can break **git** |
20:45
🔗
|
joeyh |
if they break git, and we're using git-annex, we have bigger probems |
20:45
🔗
|
SketchCow |
Well, then, switch to the sha1 |
20:45
🔗
|
joeyh |
also, we should switch to archiving github, if sha1 is broken, before it melts down into a puddle |
20:45
🔗
|
SketchCow |
So you're out for the count for a week? |
20:46
🔗
|
joeyh |
yep, 8 to 16 hour days writing a game |
20:46
🔗
|
joeyh |
and then I'm in boston for a week, but a little more available |
20:46
🔗
|
SketchCow |
https://www.schneier.com/blog/archives/2005/02/sha1_broken.html |
20:47
🔗
|
SketchCow |
:) |
20:47
🔗
|
SketchCow |
Anyway, tracey says use md5sum AND sha-1 |
20:47
🔗
|
joeyh |
yes, but ... no. |
20:47
🔗
|
joeyh |
that paper, afaik, has never been published, or the results have not been replicated |
20:48
🔗
|
joeyh |
or, it wasn't good enough for practical collisions yet, just a reduction from "impossibly hard" to "convert the sun to rackmount computers" hard |
20:48
🔗
|
joeyh |
A 2011 attack by Marc Stevens can produce hash collisions with a complexity between 260.3 and 265.3 operations.[1] No actual collisions have yet been produced. -- wiki |
20:49
🔗
|
joeyh |
that's meant to be 2^60 |
20:49
🔗
|
joeyh |
so, it was 2^69 in 2005, and 2^60 in 2011.. we can see where this is going |
20:51
🔗
|
SketchCow |
Well, understood, joeyh - of course we're going to keep doing the project but your bit will have to wait until you're back |
20:51
🔗
|
joeyh |
"estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers" |
20:51
🔗
|
joeyh |
man, I so want someone to do that |
20:51
🔗
|
yipdw |
$5 million if you use AWS |
20:52
🔗
|
joeyh |
having 2 files that sha1 the same would be very useful ;) |
20:53
🔗
|
|
joeyh suggests they find something that sha1s to 4b825dc642cb6eb9a060e54bf8d69288fbee4904 |
20:53
🔗
|
joeyh |
that's the git empty tree hash |
20:54
🔗
|
joeyh |
so, break git: $2.77M |
20:54
🔗
|
joeyh |
backup IA: $??? |
20:54
🔗
|
fenn |
$400k in 1.5TB tapes |
20:54
🔗
|
SketchCow |
https://twitter.com/danieldrucker/status/573948564608577537 |
20:57
🔗
|
yipdw |
good to know they're pulling out the money card |
20:57
🔗
|
joeyh |
I see it right their next to their asshole card |
20:58
🔗
|
SketchCow |
http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK&curid=6055&diff=22174&oldid=22172 |
20:58
🔗
|
SketchCow |
I actually know this technique. |
20:58
🔗
|
SketchCow |
It's a technique used by alpha PUAs |
20:58
🔗
|
SketchCow |
Works in sales |
20:59
🔗
|
SketchCow |
Why would you walk away from _______ |
20:59
🔗
|
yipdw |
!ao https://twitter.com/danieldrucker/status/573880074732191744 |
20:59
🔗
|
yipdw |
oops |
21:00
🔗
|
SketchCow |
I assume he's going to propose: |
21:00
🔗
|
SketchCow |
- Working with someone who has a tape drive somewhere, and put IA on those tapes. |
21:00
🔗
|
SketchCow |
And not propose: |
21:00
🔗
|
SketchCow |
- Sending a free tape drive to the archive, and tapes |
21:02
🔗
|
SketchCow |
Got the mail. |
21:02
🔗
|
SketchCow |
It's the first. |
21:02
🔗
|
SketchCow |
Sorry, not alpha PUA. |
21:02
🔗
|
SketchCow |
My apologies. |
21:02
🔗
|
SketchCow |
Academic. |
21:03
🔗
|
SketchCow |
Looks the same if you squint. |
21:03
🔗
|
SketchCow |
Somewhere in your network of contacts there has to be either someone at Oracle, or someone at a large computing center, who could donate the T10000D drives for your temporary use. |
21:03
🔗
|
SketchCow |
That's his "helping you to get access to several hundred thousand dollars of resources" |
21:04
🔗
|
SketchCow |
Telling me I should ask around for several hundred thousand dollars of resources. |
21:06
🔗
|
garyrh |
I have a spare $200k in the back of my Tesla. |
21:07
🔗
|
SketchCow |
Oh, wait, he's making calls. |
21:10
🔗
|
yipdw |
garyrh: true cool cats keep their $200k in the frunk |
21:13
🔗
|
garyrh |
http://i.imgflip.com/dk9r.gif |
21:26
🔗
|
SketchCow |
OK, I've put him over to the admins |
21:26
🔗
|
SketchCow |
the IDEA is fine. |
21:26
🔗
|
SketchCow |
The APPROACH is also fine |
21:39
🔗
|
SketchCow |
Oh thank god DFJustin fixed the animation |
21:40
🔗
|
SketchCow |
08:47, 6 March 2015 CRITICAL NEED TO USE TAPE, FULL EXPLANATION |
21:40
🔗
|
SketchCow |
16:18, 6 March 2015 burning animation fixed |
21:41
🔗
|
DFJustin |
:D |
21:43
🔗
|
SketchCow |
I'm just a little sensitive |
21:43
🔗
|
SketchCow |
To "why dick around with [solution a] when [solution b] is staring you in the face" |
21:43
🔗
|
SketchCow |
Also "I asked for people to give shit for free" == "I am getting you shit for free" |
21:44
🔗
|
SketchCow |
Anyway, so, tasks for Jason or someone else |
21:44
🔗
|
SketchCow |
- Take what's been discussed in here, get on wiki |
21:44
🔗
|
SketchCow |
- split up wiki pages into more wiki pages, this stuff's getting large |
21:45
🔗
|
SketchCow |
- continue work on census |
21:45
🔗
|
SketchCow |
- get real numbers from census |
21:48
🔗
|
SketchCow |
http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK now has appropriate gif |
21:59
🔗
|
DFJustin |
you might wanna have an asterisk that it's actually in two places, for the literal crowd |
22:11
🔗
|
SketchCow |
We're down to 13,075,201 items. |
22:11
🔗
|
SketchCow |
(In the census) |
23:15
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
23:16
🔗
|
|
svchfoo2 gives channel operator status to Start |
23:34
🔗
|
|
zottelbey has quit (Remote host closed the connection) |
23:51
🔗
|
|
edward_ has quit (Ping timeout: 512 seconds) |