#internetarchive.bak 2015-03-02,Mon

↑back Search

Time	Nickname	Message
04:11 ^🔗		BEGIN LOGGING AT Sun Mar 1 23:11:32 2015
04:11 ^🔗		Now talking on #internetarchive.bak
04:12 ^🔗		acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak
04:18 ^🔗		acridAxid has quit (Quit: Quitting)
04:21 ^🔗		mhazinsk (~matt@[redacted]) has joined #internetarchive.bak
04:25 ^🔗		pikhq (~pikhq@[redacted]) has joined #internetarchive.bak
04:25 ^🔗	pikhq	You, sir, are insane and I love you for it.
04:27 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
04:29 ^🔗		SketchCow gives channel operator status to Start trs80
04:29 ^🔗		SketchCow gives channel operator status to chfoo garyrh_ mhazinsk pikhq
04:33 ^🔗		You've invited svchfoo1 to #internetarchive.bak (irc.mzima.net)
04:33 ^🔗		svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak
04:33 ^🔗		chfoo gives channel operator status to svchfoo1
04:33 ^🔗		You've invited svchfoo2 to #internetarchive.bak (irc.mzima.net)
04:33 ^🔗		svchfoo2 (~chfoo2@[redacted]) has joined #internetarchive.bak
04:33 ^🔗		chfoo gives channel operator status to svchfoo2
04:36 ^🔗		godane (~slacker@[redacted]) has joined #internetarchive.bak
04:37 ^🔗	godane	so i have been keeping most of the internet archives web archives that i upload
04:37 ^🔗	godane	so i'm already doing your plan of sorts
04:38 ^🔗		garyrh_ gives channel operator status to godane
04:40 ^🔗	godane	i was sort of think of some sort of linux distro that hosts files are http://internet.archive
04:40 ^🔗	godane	that domain is a way to not take a domain name
04:42 ^🔗	mhazinsk	so I think https://tahoe-lafs.org/trac/tahoe-lafs would be worth looking into for this
04:43 ^🔗		chfoo has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK
04:43 ^🔗	mhazinsk	I believe they coined the term "redundant array of independent clouds"
04:46 ^🔗	SketchCow	Add all proposed solutions to the discussion tab
04:47 ^🔗	mhazinsk	will do
04:49 ^🔗		acridAxid (~acridAxid@[redacted]) has joined #internetarchive.bak
05:01 ^🔗	SketchCow	Good
05:08 ^🔗	godane	i think being able to download a full collection of something would be nice
05:09 ^🔗	godane	also folders should be something like main collection -> sub-collection -> sub-sub-collection -> item
05:15 ^🔗	SketchCow	This isn't that
05:15 ^🔗	SketchCow	I will be working on writing documentation on how to download everything you want from archive, but that's different.
05:15 ^🔗	SketchCow	This is you plug in your drive and gets stuff.
05:28 ^🔗	godane	ok
05:29 ^🔗	godane	it maybe nice to add later on then
06:39 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
06:43 ^🔗		arkiver (~arkiver@[redacted]) has joined #internetarchive.bak
06:52 ^🔗		Kazzy (~Kaz@[redacted]) has joined #internetarchive.bak
06:58 ^🔗		xmc (~chronomex@[redacted]) has joined #internetarchive.bak
07:08 ^🔗		DFJustin (DopefishJu@[redacted]) has joined #internetarchive.bak
07:10 ^🔗	DFJustin	I was just wondering what to do with the drives I'm starting to accumulate from upgrading to larger sizes
07:14 ^🔗		garyrh_ gives channel operator status to Kazzy xmc
07:14 ^🔗		garyrh_ gives channel operator status to acridAxid arkiver db48x DFJustin
07:15 ^🔗		yipdw (~yipdw@[redacted]) has joined #internetarchive.bak
07:17 ^🔗		garyrh_ gives channel operator status to yipdw
07:22 ^🔗		Ctrl-S (~Ctrl-S@[redacted]) has joined #internetarchive.bak
07:36 ^🔗	SketchCow	Definitely want to run a census against a collection.
07:36 ^🔗	SketchCow	(Size vs. no derives)
07:43 ^🔗		arkiver gives channel operator status to Ctrl-S
07:44 ^🔗	yipdw	guess I'll start re-reading about git-annex, it's been a while
07:44 ^🔗	yipdw	I do recommend that tool if only because we have developer access, which is huge
07:44 ^🔗	yipdw	also it seems like it'd work
07:48 ^🔗	SketchCow	yipdw: I'd like you to also start visualizing what a central infoboard for it might be
07:48 ^🔗	SketchCow	Some way to visualize the petabytes, bring them into form so one can look over at them and see red yellow green
07:48 ^🔗	SketchCow	like disk sectors
07:48 ^🔗	SketchCow	I think that will encourage people
07:48 ^🔗	yipdw	I can play with some ideas, though my implementation time is pretty limited
07:48 ^🔗	yipdw	I have a May deadline for a project
07:48 ^🔗	SketchCow	One nice bit of this is people can take a dock and shove in all their old hard drives
07:49 ^🔗	SketchCow	And just make them all work
07:50 ^🔗	SketchCow	I could make someone else take it on. No need for you to have to work on it when you have something coming up
07:50 ^🔗	SketchCow	It's just a fun "take this data and make it zoomable/nice"
07:51 ^🔗	SketchCow	git-annex as backend will likely save us a lot of time
07:51 ^🔗	yipdw	is there a hierarchy to IA items beyond collection -> [item]?
07:52 ^🔗	yipdw	something like http://mbostock.github.io/d3/talk/20111018/treemap.html might work
07:52 ^🔗	yipdw	top-level is collections, zoom in to see items
07:52 ^🔗	yipdw	items with zero backups are red, one yellow, two+ green
07:53 ^🔗	yipdw	I know it's possible for a browser to have all IA collections in a <select>, since that (used to) happen when you did advanced search
07:53 ^🔗	SketchCow	Well, in my visualization/vision, we don't quite do it like that.
07:53 ^🔗	yipdw	it should not be impossible to shove them all into a treemap
07:53 ^🔗	SketchCow	But maybe we should.
07:53 ^🔗	SketchCow	These are all bone simple classic CS problems, which is nice
07:53 ^🔗	yipdw	it'd also allow you to visualize the size of every collection
07:53 ^🔗	SketchCow	Just happens to be the body in charge is comfortable with our fuckery
07:53 ^🔗	yipdw	not sure if that's necessary but it can be nice
07:54 ^🔗	SketchCow	I think we're all agreeing a census needs to be taken.
07:54 ^🔗	SketchCow	The IA mining program is good for this.
07:54 ^🔗	SketchCow	https://pypi.python.org/pypi/internetarchive#data-mining
07:54 ^🔗		Rotab (~Rotab@[redacted]) has joined #internetarchive.bak
07:55 ^🔗	yipdw	ah yeah
07:55 ^🔗	yipdw	ia mine is awesome
08:03 ^🔗	SketchCow	Thought: Encrypt the data, but make it VERY easy to unencrypt?
08:03 ^🔗	SketchCow	So you can fuck with the files, get them if you want, but it will never ever be able to be packed back in for bad actor.
08:03 ^🔗	SketchCow	And by "never ever", I mean "to defraud without detection"
08:06 ^🔗	yipdw	not sure, I think it might be easier to have a trusted repository of SHA256 hashes or something
08:06 ^🔗	yipdw	need to read up more on what (say) git-annex does for this, if anything at all
08:07 ^🔗	yipdw	git-annex has a trust concept but AFAICT it is not meant to protect against hostile actors
08:07 ^🔗	yipdw	it's more about "do I trust that this repository is or can be brought online"
08:08 ^🔗	Ctrl-S	can you use a hash of the data for each block, then distribute the hashes widely?\
08:08 ^🔗	Ctrl-S	I think bittorrent uses something similar
08:09 ^🔗	yipdw	a DHT is possible but more complicated than "here's a repo of hashes, it's canonical"
08:11 ^🔗	yipdw	or were you referring to the hashes of each block
08:11 ^🔗	Ctrl-S	distribute that repo with the blocks of data?
08:11 ^🔗	yipdw	if there is to be such a repo I'd suggest it just live at IA for starters
08:11 ^🔗	yipdw	no need to distribute everything, that's too hard
08:11 ^🔗	SketchCow	It really does sound like bad actors are the only big problem
08:11 ^🔗	Ctrl-S	then what happens if IA fails?
08:12 ^🔗	SketchCow	Everything else is just UI
08:12 ^🔗	yipdw	a repo of hashes is way easier to back up than 20 petabytes of data
08:12 ^🔗	SketchCow	I think there's definitely a case of classes of users
08:12 ^🔗	yipdw	hash computation is costly but it's not too bad for items that don't change much
08:12 ^🔗	SketchCow	So, say, myself and IA and some other locations are trusted and compared with each other
08:13 ^🔗	SketchCow	And then that family of sources (Not just at IA, of course!) is used to store info on the other 50,000 assholes
08:13 ^🔗	yipdw	one way to avoid most bad actors is to not let them in on the scheme at all at first
08:13 ^🔗	SketchCow	Or to be able to ban out
08:13 ^🔗	Ctrl-S	I mean when you send out a block of data, send the latest version of the hash repo with it
08:13 ^🔗	SketchCow	right
08:13 ^🔗	yipdw	I mean, to participate in this you'd need to have some significant capital and ability to demonstrate commitment
08:13 ^🔗	SketchCow	Disagree
08:13 ^🔗	SketchCow	On the first, not the second
08:14 ^🔗	yipdw	fair enough, I was thinking of significant as "a couple thousand USD"
08:14 ^🔗	SketchCow	But it won't go over the hump if we don't have people just shoving hard drives one by one, into a dock and the driver getting assigned love
08:14 ^🔗	yipdw	maybe it's not even that though
08:14 ^🔗	yipdw	sure
08:14 ^🔗	SketchCow	I think it's $50
08:14 ^🔗	SketchCow	500gb drive
08:14 ^🔗	SketchCow	Or $0
08:14 ^🔗	SketchCow	pile of drives you weren't using at the hacker space
08:14 ^🔗	SketchCow	Even if they get used by others
08:15 ^🔗	yipdw	ah ok
08:15 ^🔗	SketchCow	I realize balancing bad actor issues vs ease is a problem, but it's a problem that's solvable.
08:16 ^🔗	yipdw	some sort of integrity checking is needed regardless of bad actors
08:16 ^🔗	SketchCow	The only thing is not to get so crippled with fear of bad actors that we hold the project backmonths
08:16 ^🔗	yipdw	so yeah
08:16 ^🔗	SketchCow	I'd like it working, with trusties, then figure out further
08:16 ^🔗	SketchCow	Trusties and some cool data on the site
08:18 ^🔗	yipdw	so, back in the Early Days
08:18 ^🔗	yipdw	underscor did something like this: https://github.com/ArchiveTeam/ia-textfiles_audio
08:18 ^🔗	yipdw	those are git-annex repositories that have archive.org as their only source
08:18 ^🔗	yipdw	so it's not what we want but it's a step
08:27 ^🔗	SketchCow	OK, bed
08:28 ^🔗	SketchCow	Please put everything you can into the wiki, I can see this project getting mired in discussions of bad actors and implementation over and over
08:28 ^🔗	SketchCow	Especially ones being addressed
08:28 ^🔗	SketchCow	I also think a working but breakable by bad actors version is a good first step
08:28 ^🔗	yipdw	ok
08:28 ^🔗	SketchCow	We can use circles of trust initially
08:29 ^🔗	SketchCow	Obviously over time, it has to be more resilient
10:37 ^🔗		fenn (~fenn@[redacted]) has joined #internetarchive.bak
11:08 ^🔗		lhobas (sid41114@[redacted]) has joined #internetarchive.bak
12:43 ^🔗		db48x has quit (Ping timeout: 258 seconds)
13:56 ^🔗		lhobas has quit (hub.se efnet.port80.se)
13:59 ^🔗		achip (~thechip@[redacted]) has joined #internetarchive.bak
14:01 ^🔗		lhobas (sid41114@[redacted]) has joined #internetarchive.bak
14:23 ^🔗		closure (~lambda@[redacted]) has joined #internetarchive.bak
14:29 ^🔗		thechip (~chipw@[redacted]) has joined #internetarchive.bak
14:31 ^🔗	closure	SketchCow, guys: so, a git-annex POV on this: 1. It would need to be under a million files. git gets janky with too many files in a repository. tar files are fine of course
14:33 ^🔗	closure	2. as the model is essentially a shared git repo that anyone in the world can write to, there will be bad actors. Stupid pushes would need to be filtered out.
14:35 ^🔗	closure	3. you want periodic verification that nodes still have their content. In git-annex terms, a fsck. Currently git-annex does not record fsck results in the git repo, and I think it would need to for this application (it's doable)
14:35 ^🔗		tephra_ (~tephra@[redacted]) has joined #internetarchive.bak
14:35 ^🔗	closure	4. awesome!
14:36 ^🔗	Kazzy	This sort of thing makes me think about looking into storj: http://storj.io/
14:36 ^🔗	Kazzy	it's nowhere near finished, but it looks like the kind of 'system' we're looking for here.. verification, multiple copies
14:36 ^🔗	closure	tahoe is also certianly worth investigating more. I lurk on their dev channel, but I can't say I understand it well enough to know how it would work in this situation
14:37 ^🔗	Kazzy	if it can be adapted to have one central host, which tells clints exactly what they need to have, it could be possible
14:37 ^🔗	Kazzy	will throw links at wiki duscussion page too
14:41 ^🔗	closure	storj looks interesting, but the first thing I see in their blog is "We’ve successfully scaled this up to 100 GiB already, and we are optimizing and tweaking to scale up another order of magnitude in the near future."
14:43 ^🔗	Kazzy	yep, it's absolutely nowhere near production ready at this point, but has potential to become a viable solution for this long-term
14:43 ^🔗	Kazzy	Can't add this to wiki talk page, some spamlist error is refusing to let me post
14:43 ^🔗	closure	although their blog is talking about proving you still have the content every 5 minutes
14:59 ^🔗	tephra_	did some quick gscholar searches and found some interesting links: https://gnunet.org/sites/default/files/10.1.1.94.4826.pdf and http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF
15:02 ^🔗		yipdw has quit (Read error: Operation timed out)
15:09 ^🔗		yipdw (~yipdw@[redacted]) has joined #internetarchive.bak
15:09 ^🔗		svchfoo2 gives channel operator status to yipdw
15:13 ^🔗		yipdw has quit (Read error: Operation timed out)
15:17 ^🔗		yipdw (~yipdw@[redacted]) has joined #internetarchive.bak
15:18 ^🔗		Start has quit (Disconnected.)
15:18 ^🔗		svchfoo1 gives channel operator status to yipdw
15:18 ^🔗		svchfoo2 gives channel operator status to yipdw
15:18 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
15:19 ^🔗		svchfoo1 gives channel operator status to Start
15:22 ^🔗		Start has quit (Client Quit)
15:37 ^🔗	SketchCow	closure: Thanks for the input.
15:49 ^🔗	SketchCow	(Added to the Wiki)
15:51 ^🔗	SketchCow	Also added storj.
15:52 ^🔗	SketchCow	So, two thoughts taking this into consideration:
15:52 ^🔗	SketchCow	- Sounds like bad actors can't easily be ruled out algorithmically.
15:53 ^🔗	SketchCow	- The way to go, therefore, is removing dilletantes and instead working to make sure all contributing of disk space is done by people comfortable with higher levels of verification.
15:53 ^🔗	SketchCow	(So a smaller pile of people stepping forward as volunteer corps instead of everyone just drops hard drives)
16:02 ^🔗	SketchCow	I am doing some in the field archiving today (going to a house to get 800 pieces of boxed software, then going to pick up 100 boxes of FOIA FBI files on communism and right wing groups)
16:02 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
16:02 ^🔗	SketchCow	But I will be thinking of this often. If people want to keep adding notes to the endeavor, that would be great.
16:05 ^🔗	SketchCow	--
16:06 ^🔗	SketchCow	Put another way, is the risk greater that someone, volunteering and signing up, and then gettting copies they mess with themselves in some dastardly fashion, greater than someone making a homemade bomb and wandering into our datacenter because they don't like the files?
16:07 ^🔗	SketchCow	The more I consider it, the more I think that since it doesn't flow BACK into the archive unless we tap you, and then we're running the checker against you anyway, the bad actor situation becomes heavily mitigated.
16:08 ^🔗	SketchCow	In theory, someone can imitate a lot of people and grab a lot of drives but they don't grab the drives.
16:08 ^🔗	SketchCow	That's a lot. A LOT, of work
16:09 ^🔗	SketchCow	I say we classify people as registered and anonymous
16:09 ^🔗	SketchCow	anonymous sectors are less dependable and don't count directly to the green
16:13 ^🔗		swebb (~swebb@[redacted]) has joined #internetarchive.bak
16:14 ^🔗		bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
16:32 ^🔗	SketchCow	http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK updated, including requested project at the bottom
16:32 ^🔗	mhazinsk	maybe tiered storage would be useful? e.g. have one copy of IA on 'trusted' users' machines, one tier in the 'cloud' (unreliable but probably not malicious), and extra copies on unregistered users (last resort and possibly malicious)
16:51 ^🔗		Start has quit (Disconnected.)
16:52 ^🔗	SketchCow	That's what I mean
16:53 ^🔗	SketchCow	But the cloud is basically either. I don't care if its hard drives in a datacenter or a user's laptop
16:54 ^🔗	swebb	How much storage is available on freenet? http://en.wikipedia.org/wiki/Freenet
16:54 ^🔗		Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak
16:55 ^🔗	swebb	That's sort of a distributed 'dark net' storage system where you provide storage on your machine for others to store stuff on, in trade, you get encrypted storage on their machine.
16:57 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
17:31 ^🔗		everdred (~irssi@[redacted]) has joined #internetarchive.bak
17:31 ^🔗		db48x (~user@[redacted]) has joined #internetarchive.bak
17:31 ^🔗		svchfoo1 gives channel operator status to db48x
17:32 ^🔗	db48x	hmm
17:40 ^🔗	DFJustin	imo keeping the already existing file checksums in several trusted places and then verifying against that in the rare case of flowing back into the archive is sufficient to address bad actor concerns
17:42 ^🔗	yipdw	for maximum geek cred you could use the hashes in the canonical URIs
17:42 ^🔗	yipdw	I guess that's sort of the freenet approach
17:43 ^🔗	DFJustin	the barrier of entry needs to stay low for people getting in on this, for example there's only around 100 warriors running at any given time and this strikes me as a bigger commitment
17:43 ^🔗	yipdw	yeah
17:43 ^🔗	DFJustin	we'll need a couple orders of magnitude more than that
17:45 ^🔗		Start has quit (Disconnected.)
17:46 ^🔗	DFJustin	every item on ia has a _files.xml file with md5, crc32, and sha1 for every file on the item https://archive.org/download/pdfy-maIfVwkWLxVuMfPP/pdfy-maIfVwkWLxVuMfPP_files.xml
17:47 ^🔗	DFJustin	granted those aren't cryptographically the best but the combination is probably decently secure
17:50 ^🔗	yipdw	sure
17:50 ^🔗	DFJustin	oh I guess you could have ia sign the files with a secret key and then verify that later
17:50 ^🔗	yipdw	if this takes off too it doesn't seem like it'd be too bad to also have IA start generating sha256
17:50 ^🔗	yipdw	or sha3 whatever
17:56 ^🔗		chazchaz (~chazchaz@[redacted]) has joined #internetarchive.bak
17:59 ^🔗	Kenshin	there are a lot of people with old drives though. it does sound possible
18:00 ^🔗	Kenshin	it's like the discussion we had over twitpic storage space. heh
18:05 ^🔗	tephra_	yipdw:
18:05 ^🔗	tephra_	yipdw: i think sha256 or even the combination would be fine enough
18:08 ^🔗	garyrh_	Finding a collision for 3 or 4 different hash/checksums would be quite a feat.
18:12 ^🔗	yipdw	tephra_: yeah, probably. I suggested SHA3 because SHA-3 can be faster than SHA-2
18:14 ^🔗	yipdw	(even faster of course is not calculating anything at all)
18:16 ^🔗	tephra_	yipdw: oh really, haven't really read up on sha3 I actually thought it was slower
18:17 ^🔗	yipdw	tephra_: I guess it's kind of a wash, but you can save 2-3 cycles/byte sometimes -> http://bench.cr.yp.to/results-sha3.html
18:17 ^🔗	yipdw	keccakc512 vs. sha256/512
18:18 ^🔗	yipdw	anyway, hashing aside
18:18 ^🔗	yipdw	heh
18:18 ^🔗	tephra_	heh
18:28 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
18:28 ^🔗		Void_ (~Void@[redacted]) has joined #internetarchive.bak
18:50 ^🔗		Start has quit (Ping timeout: 370 seconds)
19:25 ^🔗	tephra_	quick and very dirty script that prints the total size of the original files of a collection: https://gist.github.com/EricIO/56ea545df41c303e13cb
19:40 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
19:40 ^🔗		Start has quit (Read error: Connection reset by peer)
19:41 ^🔗		Start_ (~Start@[redacted]) has joined #internetarchive.bak
20:11 ^🔗		Start_ has quit (Disconnected.)
20:24 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
20:26 ^🔗		Start has quit (Client Quit)
20:28 ^🔗		SadDM (~SadDM@[redacted]) has joined #internetarchive.bak
20:32 ^🔗		bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
20:39 ^🔗		bzc6p has quit (Read error: Operation timed out)
21:19 ^🔗	SketchCow	Tephra_ can you put the totals in thw wiki?
21:19 ^🔗	SketchCow	or can someone run them?
21:32 ^🔗	SketchCow	here is the question
21:34 ^🔗	SketchCow	cryptographic without depending on the archive
21:34 ^🔗	SketchCow	good or bad
21:36 ^🔗	yipdw	I'm not sure what that means
21:36 ^🔗	yipdw	generate hashes without depending on IA?
21:45 ^🔗	SketchCow	sorry
21:46 ^🔗	SketchCow	i am in a trick
21:46 ^🔗	SketchCow	truck
21:47 ^🔗	SketchCow	so. ideal world, you have thrm crypto om the drive.
21:47 ^🔗	SketchCow	maybe a .sh on the drive that when run, umpacks it?
21:49 ^🔗	garyrh_	you mean like the files are signed with a public key?
21:50 ^🔗	garyrh_	signing just the metadata might work
21:51 ^🔗	SketchCow	I am not great at defining solitions.
21:51 ^🔗	SketchCow	having chunks out there is fine
21:52 ^🔗	SketchCow	and if we have to restore, encrypt chunks are fine.
21:52 ^🔗	SketchCow	but I want someone local to encrypt chunks.
21:52 ^🔗	SketchCow	no IA no central board. the nuclear recovery option
21:53 ^🔗	tephra_	SketchCow: for the collections already on the wiki you mean? re totals on the wiki
21:54 ^🔗	SketchCow	tephra. yes please
21:54 ^🔗	tephra_	SketchCow: right on it
21:54 ^🔗	SketchCow	thanks
21:54 ^🔗	SketchCow	full and prig
21:54 ^🔗	SketchCow	orig
21:54 ^🔗	tephra_	sure
21:55 ^🔗	SketchCow	we might have xml ans stuff missed hut it will still be useful
21:57 ^🔗	tephra_	so now the script only counts that have the 'source' label as 'original' which for example for the item https://archive.org/details/Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt
21:57 ^🔗	tephra_	are Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt.pdf_meta.txt
21:57 ^🔗	tephra_	Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt.pdf
21:57 ^🔗	tephra_	Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt_meta.xml
21:57 ^🔗	tephra_	Informatica_CPU_Ano_1_No._2_1994-12_Bonus_Rio_Editora_BR_pt_files.xml
21:58 ^🔗	tephra_	all those are labeled as 'original' in the metadata from the internetarchive python wrapper
21:59 ^🔗	SketchCow	good
21:59 ^🔗	SketchCow	agrees
21:59 ^🔗	SketchCow	agreed
21:59 ^🔗	tephra_	good
21:59 ^🔗	SketchCow	go for it.
22:00 ^🔗	SketchCow	as a bomus at the end, do "movies" ;)
22:00 ^🔗	tephra_	hehe sure
22:02 ^🔗	tephra_	do you have a smallish collection with the known total data just to sanity check?
22:05 ^🔗	SketchCow	choose a mafazine
22:07 ^🔗	tephra_	informaticacpu is ok only four items
22:07 ^🔗	SketchCow	great
22:12 ^🔗	tephra_	getting total: 266074360 and original 125517341
22:19 ^🔗	SketchCow	I'd say spreadsheet it, verify, then do the biggies
22:21 ^🔗	tephra_	is it possible to see all files for an item on archive.org can't seem to find them
22:21 ^🔗	tephra_	oh i see it, stupid of me
22:48 ^🔗	trs80	in terms of bad actors, only allowing users to have one copy of a file will help
23:07 ^🔗	tephra_	hmm the IA api wrapper doesn't give a size for the _files.xml file in the metadata
23:14 ^🔗		ivan` (~ivan@[redacted]) has joined #internetarchive.bak
23:33 ^🔗		Start (~Start@[redacted]) has joined #internetarchive.bak
23:33 ^🔗		svchfoo2 gives channel operator status to Start

irclogger-viewer