#internetarchive 2017-01-03,Tue

↑back Search

Time	Nickname	Message
00:37 ^🔗		phuzion has quit IRC (Read error: Operation timed out)
00:42 ^🔗		phuzion has joined #internetarchive
02:14 ^🔗		vitzli has joined #internetarchive
08:01 ^🔗		mistym has joined #internetarchive
09:10 ^🔗		X-Scale has joined #internetarchive
13:54 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
13:55 ^🔗		Lord_Nigh has joined #internetarchive
14:14 ^🔗		vitzli has quit IRC (Quit: Leaving)
15:51 ^🔗		kyounko\|2 has quit IRC (Read error: Connection reset by peer)
15:51 ^🔗		kyounko\|2 has joined #internetarchive
15:58 ^🔗		atomotic has joined #internetarchive
16:07 ^🔗		Martini has joined #internetarchive
16:08 ^🔗	Martini	hi
16:13 ^🔗	Martini	I was wondering, since NASA has its own TV Channel. I think that the Internet Archive should have one :)
16:29 ^🔗		atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
16:40 ^🔗		atomotic has joined #internetarchive
16:50 ^🔗	DFJustin	they don't have the budget for live programming, but a continuous stream of public domain stuff from the collections could be cool
16:51 ^🔗	xmc	this is a thing that you can throw together yourself, even :)
16:52 ^🔗		atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
17:17 ^🔗		atomotic has joined #internetarchive
17:31 ^🔗		atomotic has quit IRC (Remote host closed the connection)
18:24 ^🔗		tentkls_i has joined #internetarchive
18:25 ^🔗	tentkls_i	what is ia policy on scraping
18:28 ^🔗	DFJustin	scraping of the IA? they love it I think, try the python library https://internetarchive.readthedocs.io/en/latest/
18:29 ^🔗	tentkls_i	thats good to know :D
18:34 ^🔗	tentkls_i	are there any major holders of data here? 10T minimum i've looked at the website you all have. seems very disparate
18:37 ^🔗	tentkls_i	i have ~50T all deduped unique looking to merge data via externals. github.com/skrp/MKRX
18:40 ^🔗	DFJustin	IA has over 14PB http://archiveteam.org/index.php?title=Internet_Archive_Census
18:51 ^🔗	tentkls_i	looks like it is scraped metadata. in actuality to scrap IA would take a PB to extract unique data which is what ive been working on
18:52 ^🔗	tentkls_i	hashsums does not mean it is uniq; it just means the container is uniq;
18:57 ^🔗	SketchCow	Literally knock yourself out.
18:59 ^🔗	tentkls_i	i expected to find ppl here with data
18:59 ^🔗	tentkls_i	i just started to scrape IA last month and brought in 15T; is no one else interested?
19:00 ^🔗	tentkls_i	disparate backups is near worthless. to create a seed it has to be efficent, sterilized and centralized
19:01 ^🔗	xmc	you might want to check out the ia.bak project?
19:01 ^🔗	tentkls_i	thats what i was referring too as being a failed idea
19:02 ^🔗	SketchCow	:)
19:02 ^🔗	SketchCow	efficent, sterilized and centralized
19:02 ^🔗	tentkls_i	well not to be mean, but if you think about it. if you dont have access to the data ... its not a backup
19:02 ^🔗	SketchCow	Well, at least the uniforms will look good
19:02 ^🔗	xmc	failed idea?
19:02 ^🔗	xmc	i don't see you saying that in this channel, was it somewhere else?
19:03 ^🔗	tentkls_i	the main limitations are bandwidth and storage
19:03 ^🔗	tentkls_i	if you want to combine efforts you focus on combining bandwidth and centralize data via externals
19:03 ^🔗	tentkls_i	but you need a centralized dump or its not a backup
19:04 ^🔗	xmc	i have a backup of my laptop on an sd card in my wallet. it's offline, thus unavailable, does that mean it's not a backup?
19:04 ^🔗	xmc	no, you don't need centralized
19:04 ^🔗	xmc	you just need a reasonable guarantee of being able to recall it as needed
19:04 ^🔗	tentkls_i	ia.bak gl in getting those all centralized
19:05 ^🔗	tentkls_i	im not demeaning the efforts but there is a glaring error
19:05 ^🔗	xmc	ia.bak requires hosts to check in monthly
19:05 ^🔗	xmc	otherwise they get marked as "untrusted" and the data is given to someone else to also have a copy of
19:06 ^🔗	xmc	and the goal is to have two non-IA copies of everything at all times
19:07 ^🔗	xmc	or is there something i'm missing?
19:07 ^🔗	tentkls_i	ia can have everything because they have PBs; but us we need to scrape intelligently
19:08 ^🔗	tentkls_i	they are indisputeably bloated. you can cut it down drastically via intelligent scraping
19:08 ^🔗	xmc	i don't understand what you're saying
19:10 ^🔗	tentkls_i	bandwidth is the main limitation. storage is the second. multiple nodes deals with bandwidth via smart scraping / externals to send data
19:11 ^🔗	xmc	i don't know what you mean by "smart scraping" and "externals", please be more detailed
19:12 ^🔗	tentkls_i	my server is ~200T raw; i mail externals with data; the reciever takes the data then fills it with their data; in a centralized fashion
19:12 ^🔗	xmc	what is "externals"
19:12 ^🔗	tentkls_i	external hdds
19:12 ^🔗	xmc	oh ok
19:13 ^🔗	xmc	ia.bak as i understand it is meant to work in the background, not requiring any attention from the participants once it's set up
19:13 ^🔗	tentkls_i	ia can afford to have 20 different encodings of a file. but anyone else can't. so you take the encoding you trust as the stable
19:13 ^🔗	xmc	sending physical mail is ... not that
19:14 ^🔗	xmc	right
19:14 ^🔗	xmc	the "source" is USUALLY enough
19:14 ^🔗	xmc	there are (rather rare) instances where you need to keep the derived data
19:15 ^🔗	tentkls_i	source?
19:15 ^🔗	xmc	"source" being whatever was uploaded originally, "derived" are the different encodings that IA created
19:17 ^🔗	tentkls_i	xmc let me give you an example
19:17 ^🔗	xmc	ok
19:17 ^🔗	tentkls_i	the pdf if 20M; the compressed 'all' version is 700M https://archive.org/download/theoryofliteratu00inwell
19:17 ^🔗	xmc	the pdf isn't the source though
19:18 ^🔗	xmc	the source data, from the scan, that you REALLY want to keep, is the _orig_jp2.tar file, which is 413 megs
19:19 ^🔗	xmc	you can make the pdf again from the scans, but you can't make the scans from the pdf
19:19 ^🔗	tentkls_i	yes but in bandwidth limited env. you take what you need
19:20 ^🔗	xmc	that's why you do it ahead of time, so it's not an emergency and you can get the good stuff
19:20 ^🔗	tentkls_i	pdf is the most likely to be verified. i choose to dl 20 pdfs over one src
19:20 ^🔗	xmc	verified?
19:20 ^🔗	tentkls_i	meaning someone has likely viewed
19:21 ^🔗	xmc	i don't understand, use more words please?
19:21 ^🔗	tentkls_i	encodings fk up alot; so you have to go with the one that is likely to be the most widely used
19:21 ^🔗	tentkls_i	on the hope, assumption, that any kinks have been ironed out
19:22 ^🔗	xmc	pdf is made from source images, which are created by hand and verified by the person operating the scanner
19:23 ^🔗	tentkls_i	do you get what i mean by 'intelligent scraping' ? tho you might disagree that is the right adjective
19:24 ^🔗	xmc	i would say pragmatic & opinionated, but i understand
19:24 ^🔗	tentkls_i	well its like a curator of a museum
19:24 ^🔗	xmc	yes
19:24 ^🔗	tentkls_i	you have to have your own style or you are just a packrat :D
19:24 ^🔗	xmc	i suppose
19:25 ^🔗	tentkls_i	getting source is the best you are right, and you made me understand that; but i believe it is not worth 20xs the bandwidth
19:26 ^🔗	tentkls_i	i take a collection, investigate what makes it common and make a decision; then i scrape. that is my methode
19:27 ^🔗	tentkls_i	i can host all your data; i can front the exernals; and you can request all data; such transaction to scrapers can not take place out of bandwidth
19:28 ^🔗	tentkls_i	i can mail a 8T full; that would takes months for the normal person in bandwidth; tho only 3 days shipping and 1 day xfer
19:35 ^🔗	tentkls_i	any veteran scraper knows you don't automate scraping; if you 'backup data' has a worse connection than the source then...
19:36 ^🔗	xmc	so i'm not sure you understand ia.bak at all
19:36 ^🔗	tentkls_i	ive read into it several times; maybe it dont
19:36 ^🔗	xmc	but it seems like you are set in your ways and not interested anyway
19:36 ^🔗	xmc	anyway, don't stop!
19:36 ^🔗	tentkls_i	i dont expect you to agree; just thought i give u a diff perspective
19:37 ^🔗	xmc	kool thx
19:42 ^🔗		Asparagir has joined #internetarchive
19:49 ^🔗		tentkls_i has left Leaving
20:07 ^🔗		kyan has joined #internetarchive
22:40 ^🔗		Martini has quit IRC (Ping timeout: 255 seconds)
23:59 ^🔗		X-Scale has quit IRC (Ping timeout: 240 seconds)

irclogger-viewer