Time |
Nickname |
Message |
04:11
🔗
|
|
BEGIN LOGGING AT Sun Mar 1 23:11:32 2015 |
04:11
🔗
|
|
Now talking on #internetarchive.bak |
04:12
🔗
|
|
acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak |
04:18
🔗
|
|
acridAxid has quit (Quit: Quitting) |
04:21
🔗
|
|
mhazinsk (~matt@[redacted]) has joined #internetarchive.bak |
04:25
🔗
|
|
pikhq (~pikhq@[redacted]) has joined #internetarchive.bak |
04:25
🔗
|
pikhq |
You, sir, are insane and I love you for it. |
04:27
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
04:29
🔗
|
|
SketchCow gives channel operator status to Start trs80 |
04:29
🔗
|
|
SketchCow gives channel operator status to chfoo garyrh_ mhazinsk pikhq |
04:33
🔗
|
|
You've invited svchfoo1 to #internetarchive.bak (irc.mzima.net) |
04:33
🔗
|
|
svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak |
04:33
🔗
|
|
chfoo gives channel operator status to svchfoo1 |
04:33
🔗
|
|
You've invited svchfoo2 to #internetarchive.bak (irc.mzima.net) |
04:33
🔗
|
|
svchfoo2 (~chfoo2@[redacted]) has joined #internetarchive.bak |
04:33
🔗
|
|
chfoo gives channel operator status to svchfoo2 |
04:36
🔗
|
|
godane (~slacker@[redacted]) has joined #internetarchive.bak |
04:37
🔗
|
godane |
so i have been keeping most of the internet archives web archives that i upload |
04:37
🔗
|
godane |
so i'm already doing your plan of sorts |
04:38
🔗
|
|
garyrh_ gives channel operator status to godane |
04:40
🔗
|
godane |
i was sort of think of some sort of linux distro that hosts files are http://internet.archive |
04:40
🔗
|
godane |
that domain is a way to not take a domain name |
04:42
🔗
|
mhazinsk |
so I think https://tahoe-lafs.org/trac/tahoe-lafs would be worth looking into for this |
04:43
🔗
|
|
chfoo has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK |
04:43
🔗
|
mhazinsk |
I believe they coined the term "redundant array of independent clouds" |
04:46
🔗
|
SketchCow |
Add all proposed solutions to the discussion tab |
04:47
🔗
|
mhazinsk |
will do |
04:49
🔗
|
|
acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak |
05:01
🔗
|
SketchCow |
Good |
05:08
🔗
|
godane |
i think being able to download a full collection of something would be nice |
05:09
🔗
|
godane |
also folders should be something like main collection -> sub-collection -> sub-sub-collection -> item |
05:15
🔗
|
SketchCow |
This isn't that |
05:15
🔗
|
SketchCow |
I will be working on writing documentation on how to download everything you want from archive, but that's different. |
05:15
🔗
|
SketchCow |
This is you plug in your drive and gets stuff. |
05:28
🔗
|
godane |
ok |
05:29
🔗
|
godane |
it maybe nice to add later on then |
06:39
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
06:43
🔗
|
|
arkiver (~arkiver@[redacted]) has joined #internetarchive.bak |
06:52
🔗
|
|
Kazzy (~Kaz@[redacted]) has joined #internetarchive.bak |
06:58
🔗
|
|
xmc (~chronomex@[redacted]) has joined #internetarchive.bak |
07:08
🔗
|
|
DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak |
07:10
🔗
|
DFJustin |
I was just wondering what to do with the drives I'm starting to accumulate from upgrading to larger sizes |
07:14
🔗
|
|
garyrh_ gives channel operator status to Kazzy xmc |
07:14
🔗
|
|
garyrh_ gives channel operator status to acridAxid arkiver db48x DFJustin |
07:15
🔗
|
|
yipdw (~yipdw@[redacted]) has joined #internetarchive.bak |
07:17
🔗
|
|
garyrh_ gives channel operator status to yipdw |
07:22
🔗
|
|
Ctrl-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak |
07:36
🔗
|
SketchCow |
Definitely want to run a census against a collection. |
07:36
🔗
|
SketchCow |
(Size vs. no derives) |
07:43
🔗
|
|
arkiver gives channel operator status to Ctrl-S |
07:44
🔗
|
yipdw |
guess I'll start re-reading about git-annex, it's been a while |
07:44
🔗
|
yipdw |
I do recommend that tool if only because we have developer access, which is huge |
07:44
🔗
|
yipdw |
also it seems like it'd work |
07:48
🔗
|
SketchCow |
yipdw: I'd like you to also start visualizing what a central infoboard for it might be |
07:48
🔗
|
SketchCow |
Some way to visualize the petabytes, bring them into form so one can look over at them and see red yellow green |
07:48
🔗
|
SketchCow |
like disk sectors |
07:48
🔗
|
SketchCow |
I think that will encourage people |
07:48
🔗
|
yipdw |
I can play with some ideas, though my implementation time is pretty limited |
07:48
🔗
|
yipdw |
I have a May deadline for a project |
07:48
🔗
|
SketchCow |
One nice bit of this is people can take a dock and shove in all their old hard drives |
07:49
🔗
|
SketchCow |
And just make them all work |
07:50
🔗
|
SketchCow |
I could make someone else take it on. No need for you to have to work on it when you have something coming up |
07:50
🔗
|
SketchCow |
It's just a fun "take this data and make it zoomable/nice" |
07:51
🔗
|
SketchCow |
git-annex as backend will likely save us a lot of time |
07:51
🔗
|
yipdw |
is there a hierarchy to IA items beyond collection -> [item]? |
07:52
🔗
|
yipdw |
something like http://mbostock.github.io/d3/talk/20111018/treemap.html might work |
07:52
🔗
|
yipdw |
top-level is collections, zoom in to see items |
07:52
🔗
|
yipdw |
items with zero backups are red, one yellow, two+ green |
07:53
🔗
|
yipdw |
I know it's possible for a browser to have all IA collections in a <select>, since that (used to) happen when you did advanced search |
07:53
🔗
|
SketchCow |
Well, in my visualization/vision, we don't quite do it like that. |
07:53
🔗
|
yipdw |
it should not be impossible to shove them all into a treemap |
07:53
🔗
|
SketchCow |
But maybe we should. |
07:53
🔗
|
SketchCow |
These are all bone simple classic CS problems, which is nice |
07:53
🔗
|
yipdw |
it'd also allow you to visualize the size of every collection |
07:53
🔗
|
SketchCow |
Just happens to be the body in charge is comfortable with our fuckery |
07:53
🔗
|
yipdw |
not sure if that's necessary but it can be nice |
07:54
🔗
|
SketchCow |
I think we're all agreeing a census needs to be taken. |
07:54
🔗
|
SketchCow |
The IA mining program is good for this. |
07:54
🔗
|
SketchCow |
https://pypi.python.org/pypi/internetarchive#data-mining |
07:54
🔗
|
|
Rotab (~Rotab@[redacted]) has joined #internetarchive.bak |
07:55
🔗
|
yipdw |
ah yeah |
07:55
🔗
|
yipdw |
ia mine is awesome |
08:03
🔗
|
SketchCow |
Thought: Encrypt the data, but make it VERY easy to unencrypt? |
08:03
🔗
|
SketchCow |
So you can fuck with the files, get them if you want, but it will never ever be able to be packed back in for bad actor. |
08:03
🔗
|
SketchCow |
And by "never ever", I mean "to defraud without detection" |
08:06
🔗
|
yipdw |
not sure, I think it might be easier to have a trusted repository of SHA256 hashes or something |
08:06
🔗
|
yipdw |
need to read up more on what (say) git-annex does for this, if anything at all |
08:07
🔗
|
yipdw |
git-annex has a trust concept but AFAICT it is not meant to protect against hostile actors |
08:07
🔗
|
yipdw |
it's more about "do I trust that this repository is or can be brought online" |
08:08
🔗
|
Ctrl-S |
can you use a hash of the data for each block, then distribute the hashes widely?\ |
08:08
🔗
|
Ctrl-S |
I think bittorrent uses something similar |
08:09
🔗
|
yipdw |
a DHT is possible but more complicated than "here's a repo of hashes, it's canonical" |
08:11
🔗
|
yipdw |
or were you referring to the hashes of each block |
08:11
🔗
|
Ctrl-S |
distribute that repo with the blocks of data? |
08:11
🔗
|
yipdw |
if there is to be such a repo I'd suggest it just live at IA for starters |
08:11
🔗
|
yipdw |
no need to distribute everything, that's too hard |
08:11
🔗
|
SketchCow |
It really does sound like bad actors are the only big problem |
08:11
🔗
|
Ctrl-S |
then what happens if IA fails? |
08:12
🔗
|
SketchCow |
Everything else is just UI |
08:12
🔗
|
yipdw |
a repo of hashes is way easier to back up than 20 petabytes of data |
08:12
🔗
|
SketchCow |
I think there's definitely a case of classes of users |
08:12
🔗
|
yipdw |
hash computation is costly but it's not too bad for items that don't change much |
08:12
🔗
|
SketchCow |
So, say, myself and IA and some other locations are trusted and compared with each other |
08:13
🔗
|
SketchCow |
And then that family of sources (Not just at IA, of course!) is used to store info on the other 50,000 assholes |
08:13
🔗
|
yipdw |
one way to avoid most bad actors is to not let them in on the scheme at all at first |
08:13
🔗
|
SketchCow |
Or to be able to ban out |
08:13
🔗
|
Ctrl-S |
I mean when you send out a block of data, send the latest version of the hash repo with it |
08:13
🔗
|
SketchCow |
right |
08:13
🔗
|
yipdw |
I mean, to participate in this you'd need to have some significant capital and ability to demonstrate commitment |
08:13
🔗
|
SketchCow |
Disagree |
08:13
🔗
|
SketchCow |
On the first, not the second |
08:14
🔗
|
yipdw |
fair enough, I was thinking of significant as "a couple thousand USD" |
08:14
🔗
|
SketchCow |
But it won't go over the hump if we don't have people just shoving hard drives one by one, into a dock and the driver getting assigned love |
08:14
🔗
|
yipdw |
maybe it's not even that though |
08:14
🔗
|
yipdw |
sure |
08:14
🔗
|
SketchCow |
I think it's $50 |
08:14
🔗
|
SketchCow |
500gb drive |
08:14
🔗
|
SketchCow |
Or $0 |
08:14
🔗
|
SketchCow |
pile of drives you weren't using at the hacker space |
08:14
🔗
|
SketchCow |
Even if they get used by others |
08:15
🔗
|
yipdw |
ah ok |
08:15
🔗
|
SketchCow |
I realize balancing bad actor issues vs ease is a problem, but it's a problem that's solvable. |
08:16
🔗
|
yipdw |
some sort of integrity checking is needed regardless of bad actors |
08:16
🔗
|
SketchCow |
The only thing is not to get so crippled with fear of bad actors that we hold the project backmonths |
08:16
🔗
|
yipdw |
so yeah |
08:16
🔗
|
SketchCow |
I'd like it working, with trusties, then figure out further |
08:16
🔗
|
SketchCow |
Trusties and some cool data on the site |
08:18
🔗
|
yipdw |
so, back in the Early Days |
08:18
🔗
|
yipdw |
underscor did something like this: https://github.com/ArchiveTeam/ia-textfiles_audio |
08:18
🔗
|
yipdw |
those are git-annex repositories that have archive.org as their only source |
08:18
🔗
|
yipdw |
so it's not what we want but it's a step |
08:27
🔗
|
SketchCow |
OK, bed |
08:28
🔗
|
SketchCow |
Please put everything you can into the wiki, I can see this project getting mired in discussions of bad actors and implementation over and over |
08:28
🔗
|
SketchCow |
Especially ones being addressed |
08:28
🔗
|
SketchCow |
I also think a working but breakable by bad actors version is a good first step |
08:28
🔗
|
yipdw |
ok |
08:28
🔗
|
SketchCow |
We can use circles of trust initially |
08:29
🔗
|
SketchCow |
Obviously over time, it has to be more resilient |
10:37
🔗
|
|
fenn (~fenn@[redacted]) has joined #internetarchive.bak |
11:08
🔗
|
|
lhobas (sid41114@[redacted]) has joined #internetarchive.bak |
12:43
🔗
|
|
db48x has quit (Ping timeout: 258 seconds) |
13:56
🔗
|
|
lhobas has quit (hub.se efnet.port80.se) |
13:59
🔗
|
|
achip (~thechip@[redacted]) has joined #internetarchive.bak |
14:01
🔗
|
|
lhobas (sid41114@[redacted]) has joined #internetarchive.bak |
14:23
🔗
|
|
closure (~lambda@[redacted]) has joined #internetarchive.bak |
14:29
🔗
|
|
thechip (~chipw@[redacted]) has joined #internetarchive.bak |
14:31
🔗
|
closure |
SketchCow, guys: so, a git-annex POV on this: 1. It would need to be under a million files. git gets janky with too many files in a repository. tar files are fine of course |
14:33
🔗
|
closure |
2. as the model is essentially a shared git repo that anyone in the world can write to, there will be bad actors. Stupid pushes would need to be filtered out. |
14:35
🔗
|
closure |
3. you want periodic verification that nodes still have their content. In git-annex terms, a fsck. Currently git-annex does not record fsck results in the git repo, and I think it would need to for this application (it's doable) |
14:35
🔗
|
|
tephra_ (~tephra@[redacted]) has joined #internetarchive.bak |
14:35
🔗
|
closure |
4. awesome! |
14:36
🔗
|
Kazzy |
This sort of thing makes me think about looking into storj: http://storj.io/ |
14:36
🔗
|
Kazzy |
it's nowhere near finished, but it looks like the kind of 'system' we're looking for here.. verification, multiple copies |
14:36
🔗
|
closure |
tahoe is also certianly worth investigating more. I lurk on their dev channel, but I can't say I understand it well enough to know how it would work in this situation |
14:37
🔗
|
Kazzy |
if it can be adapted to have one central host, which tells clints exactly what they need to have, it could be possible |
14:37
🔗
|
Kazzy |
will throw links at wiki duscussion page too |
14:41
🔗
|
closure |
storj looks interesting, but the first thing I see in their blog is "We’ve successfully scaled this up to 100 GiB already, and we are optimizing and tweaking to scale up another order of magnitude in the near future." |
14:43
🔗
|
Kazzy |
yep, it's absolutely nowhere near production ready at this point, but has potential to become a viable solution for this long-term |
14:43
🔗
|
Kazzy |
Can't add this to wiki talk page, some spamlist error is refusing to let me post |
14:43
🔗
|
closure |
although their blog is talking about proving you still have the content every 5 minutes |
14:59
🔗
|
tephra_ |
did some quick gscholar searches and found some interesting links: https://gnunet.org/sites/default/files/10.1.1.94.4826.pdf and http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF |
15:02
🔗
|
|
yipdw has quit (Read error: Operation timed out) |
15:09
🔗
|
|
yipdw (~yipdw@[redacted]) has joined #internetarchive.bak |
15:09
🔗
|
|
svchfoo2 gives channel operator status to yipdw |
15:13
🔗
|
|
yipdw has quit (Read error: Operation timed out) |
15:17
🔗
|
|
yipdw (~yipdw@[redacted]) has joined #internetarchive.bak |
15:18
🔗
|
|
Start has quit (Disconnected.) |
15:18
🔗
|
|
svchfoo1 gives channel operator status to yipdw |
15:18
🔗
|
|
svchfoo2 gives channel operator status to yipdw |
15:18
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
15:19
🔗
|
|
svchfoo1 gives channel operator status to Start |
15:22
🔗
|
|
Start has quit (Client Quit) |
15:37
🔗
|
SketchCow |
closure: Thanks for the input. |
15:49
🔗
|
SketchCow |
(Added to the Wiki) |
15:51
🔗
|
SketchCow |
Also added storj. |
15:52
🔗
|
SketchCow |
So, two thoughts taking this into consideration: |
15:52
🔗
|
SketchCow |
- Sounds like bad actors can't easily be ruled out algorithmically. |
15:53
🔗
|
SketchCow |
- The way to go, therefore, is removing dilletantes and instead working to make sure all contributing of disk space is done by people comfortable with higher levels of verification. |
15:53
🔗
|
SketchCow |
(So a smaller pile of people stepping forward as volunteer corps instead of everyone just drops hard drives) |
16:02
🔗
|
SketchCow |
I am doing some in the field archiving today (going to a house to get 800 pieces of boxed software, then going to pick up 100 boxes of FOIA FBI files on communism and right wing groups) |
16:02
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
16:02
🔗
|
SketchCow |
But I will be thinking of this often. If people want to keep adding notes to the endeavor, that would be great. |
16:05
🔗
|
SketchCow |
-- |
16:06
🔗
|
SketchCow |
Put another way, is the risk greater that someone, volunteering and signing up, and then gettting copies they mess with themselves in some dastardly fashion, greater than someone making a homemade bomb and wandering into our datacenter because they don't like the files? |
16:07
🔗
|
SketchCow |
The more I consider it, the more I think that since it doesn't flow BACK into the archive unless we tap you, and then we're running the checker against you anyway, the bad actor situation becomes heavily mitigated. |
16:08
🔗
|
SketchCow |
In theory, someone can imitate a lot of people and grab a lot of drives but they don't grab the drives. |
16:08
🔗
|
SketchCow |
That's a lot. A LOT, of work |
16:09
🔗
|
SketchCow |
I say we classify people as registered and anonymous |
16:09
🔗
|
SketchCow |
anonymous sectors are less dependable and don't count directly to the green |
16:13
🔗
|
|
swebb (~swebb@[redacted]) has joined #internetarchive.bak |
16:14
🔗
|
|
bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak |
16:32
🔗
|
SketchCow |
http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK updated, including requested project at the bottom |
16:32
🔗
|
mhazinsk |
maybe tiered storage would be useful? e.g. have one copy of IA on 'trusted' users' machines, one tier in the 'cloud' (unreliable but probably not malicious), and extra copies on unregistered users (last resort and possibly malicious) |
16:51
🔗
|
|
Start has quit (Disconnected.) |
16:52
🔗
|
SketchCow |
That's what I mean |
16:53
🔗
|
SketchCow |
But the cloud is basically either. I don't care if its hard drives in a datacenter or a user's laptop |
16:54
🔗
|
swebb |
How much storage is available on freenet? http://en.wikipedia.org/wiki/Freenet |
16:54
🔗
|
|
Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak |
16:55
🔗
|
swebb |
That's sort of a distributed 'dark net' storage system where you provide storage on your machine for others to store stuff on, in trade, you get encrypted storage on their machine. |
16:57
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
17:31
🔗
|
|
everdred (~irssi@[redacted]) has joined #internetarchive.bak |
17:31
🔗
|
|
db48x (~user@[redacted]) has joined #internetarchive.bak |
17:31
🔗
|
|
svchfoo1 gives channel operator status to db48x |
17:32
🔗
|
db48x |
hmm |
17:40
🔗
|
DFJustin |
imo keeping the already existing file checksums in several trusted places and then verifying against that in the rare case of flowing back into the archive is sufficient to address bad actor concerns |
17:42
🔗
|
yipdw |
for maximum geek cred you could use the hashes in the canonical URIs |
17:42
🔗
|
yipdw |
I guess that's sort of the freenet approach |
17:43
🔗
|
DFJustin |
the barrier of entry needs to stay low for people getting in on this, for example there's only around 100 warriors running at any given time and this strikes me as a bigger commitment |
17:43
🔗
|
yipdw |
yeah |
17:43
🔗
|
DFJustin |
we'll need a couple orders of magnitude more than that |
17:45
🔗
|
|
Start has quit (Disconnected.) |
17:46
🔗
|
DFJustin |
every item on ia has a _files.xml file with md5, crc32, and sha1 for every file on the item https://archive.org/download/pdfy-maIfVwkWLxVuMfPP/pdfy-maIfVwkWLxVuMfPP_files.xml |
17:47
🔗
|
DFJustin |
granted those aren't cryptographically the best but the combination is probably decently secure |
17:50
🔗
|
yipdw |
sure |
17:50
🔗
|
DFJustin |
oh I guess you could have ia sign the files with a secret key and then verify that later |
17:50
🔗
|
yipdw |
if this takes off too it doesn't seem like it'd be too bad to also have IA start generating sha256 |
17:50
🔗
|
yipdw |
or sha3 whatever |
17:56
🔗
|
|
chazchaz (~chazchaz@[redacted]) has joined #internetarchive.bak |
17:59
🔗
|
Kenshin |
there are a lot of people with old drives though. it does sound possible |
18:00
🔗
|
Kenshin |
it's like the discussion we had over twitpic storage space. heh |
18:05
🔗
|
tephra_ |
yipdw: |
18:05
🔗
|
tephra_ |
yipdw: i think sha256 or even the combination would be fine enough |
18:08
🔗
|
garyrh_ |
Finding a collision for 3 or 4 different hash/checksums would be quite a feat. |
18:12
🔗
|
yipdw |
tephra_: yeah, probably. I suggested SHA3 because SHA-3 can be faster than SHA-2 |
18:14
🔗
|
yipdw |
(even faster of course is not calculating anything at all) |
18:16
🔗
|
tephra_ |
yipdw: oh really, haven't really read up on sha3 I actually thought it was slower |
18:17
🔗
|
yipdw |
tephra_: I guess it's kind of a wash, but you can save 2-3 cycles/byte sometimes -> http://bench.cr.yp.to/results-sha3.html |
18:17
🔗
|
yipdw |
keccakc512 vs. sha256/512 |
18:18
🔗
|
yipdw |
anyway, hashing aside |
18:18
🔗
|
yipdw |
heh |
18:18
🔗
|
tephra_ |
heh |
18:28
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
18:28
🔗
|
|
Void_ (~Void@[redacted]) has joined #internetarchive.bak |
18:50
🔗
|
|
Start has quit (Ping timeout: 370 seconds) |
19:25
🔗
|
tephra_ |
quick and very dirty script that prints the total size of the original files of a collection: https://gist.github.com/EricIO/56ea545df41c303e13cb |
19:40
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
19:40
🔗
|
|
Start has quit (Read error: Connection reset by peer) |
19:41
🔗
|
|
Start_ (~Start@[redacted]) has joined #internetarchive.bak |
20:11
🔗
|
|
Start_ has quit (Disconnected.) |
20:24
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
20:26
🔗
|
|
Start has quit (Client Quit) |
20:28
🔗
|
|
SadDM (~SadDM@[redacted]) has joined #internetarchive.bak |
20:32
🔗
|
|
bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak |
20:39
🔗
|
|
bzc6p has quit (Read error: Operation timed out) |
21:19
🔗
|
SketchCow |
Tephra_ can you put the totals in thw wiki? |
21:19
🔗
|
SketchCow |
or can someone run them? |
21:32
🔗
|
SketchCow |
here is the question |
21:34
🔗
|
SketchCow |
cryptographic without depending on the archive |
21:34
🔗
|
SketchCow |
good or bad |
21:36
🔗
|
yipdw |
I'm not sure what that means |
21:36
🔗
|
yipdw |
generate hashes without depending on IA? |
21:45
🔗
|
SketchCow |
sorry |
21:46
🔗
|
SketchCow |
i am in a trick |
21:46
🔗
|
SketchCow |
truck |
21:47
🔗
|
SketchCow |
so. ideal world, you have thrm crypto om the drive. |
21:47
🔗
|
SketchCow |
maybe a .sh on the drive that when run, umpacks it? |
21:49
🔗
|
garyrh_ |
you mean like the files are signed with a public key? |
21:50
🔗
|
garyrh_ |
signing just the metadata might work |
21:51
🔗
|
SketchCow |
I am not great at defining solitions. |
21:51
🔗
|
SketchCow |
having chunks out there is fine |
21:52
🔗
|
SketchCow |
and if we have to restore, encrypt chunks are fine. |
21:52
🔗
|
SketchCow |
but I want someone local to encrypt chunks. |
21:52
🔗
|
SketchCow |
no IA no central board. the nuclear recovery option |
21:53
🔗
|
tephra_ |
SketchCow: for the collections already on the wiki you mean? re totals on the wiki |
21:54
🔗
|
SketchCow |
tephra. yes please |
21:54
🔗
|
tephra_ |
SketchCow: right on it |
21:54
🔗
|
SketchCow |
thanks |
21:54
🔗
|
SketchCow |
full and prig |
21:54
🔗
|
SketchCow |
orig |
21:54
🔗
|
tephra_ |
sure |
21:55
🔗
|
SketchCow |
we might have xml ans stuff missed hut it will still be useful |
21:57
🔗
|
tephra_ |
so now the script only counts that have the 'source' label as 'original' which for example for the item https://archive.org/details/Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt |
21:57
🔗
|
tephra_ |
are Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt.pdf_meta.txt |
21:57
🔗
|
tephra_ |
Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt.pdf |
21:57
🔗
|
tephra_ |
Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt_meta.xml |
21:57
🔗
|
tephra_ |
Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt_files.xml |
21:58
🔗
|
tephra_ |
all those are labeled as 'original' in the metadata from the internetarchive python wrapper |
21:59
🔗
|
SketchCow |
good |
21:59
🔗
|
SketchCow |
agrees |
21:59
🔗
|
SketchCow |
agreed |
21:59
🔗
|
tephra_ |
good |
21:59
🔗
|
SketchCow |
go for it. |
22:00
🔗
|
SketchCow |
as a bomus at the end, do "movies" ;) |
22:00
🔗
|
tephra_ |
hehe sure |
22:02
🔗
|
tephra_ |
do you have a smallish collection with the known total data just to sanity check? |
22:05
🔗
|
SketchCow |
choose a mafazine |
22:07
🔗
|
tephra_ |
informaticacpu is ok only four items |
22:07
🔗
|
SketchCow |
great |
22:12
🔗
|
tephra_ |
getting total: 266074360 and original 125517341 |
22:19
🔗
|
SketchCow |
I'd say spreadsheet it, verify, then do the biggies |
22:21
🔗
|
tephra_ |
is it possible to see all files for an item on archive.org can't seem to find them |
22:21
🔗
|
tephra_ |
oh i see it, stupid of me |
22:48
🔗
|
trs80 |
in terms of bad actors, only allowing users to have one copy of a file will help |
23:07
🔗
|
tephra_ |
hmm the IA api wrapper doesn't give a size for the _files.xml file in the metadata |
23:14
🔗
|
|
ivan` (~ivan@[redacted]) has joined #internetarchive.bak |
23:33
🔗
|
|
Start (~Start@[redacted]) has joined #internetarchive.bak |
23:33
🔗
|
|
svchfoo2 gives channel operator status to Start |