[00:37] *** phuzion has quit IRC (Read error: Operation timed out)
[00:42] *** phuzion has joined #internetarchive
[02:14] *** vitzli has joined #internetarchive
[08:01] *** mistym has joined #internetarchive
[09:10] *** X-Scale has joined #internetarchive
[13:54] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
[13:55] *** Lord_Nigh has joined #internetarchive
[14:14] *** vitzli has quit IRC (Quit: Leaving)
[15:51] *** kyounko|2 has quit IRC (Read error: Connection reset by peer)
[15:51] *** kyounko|2 has joined #internetarchive
[15:58] *** atomotic has joined #internetarchive
[16:07] *** Martini has joined #internetarchive
[16:08] <Martini> hi
[16:13] <Martini> I was wondering, since NASA has its own TV Channel. I think that the Internet Archive should have one :)
[16:29] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
[16:40] *** atomotic has joined #internetarchive
[16:50] <DFJustin> they don't have the budget for live programming, but a continuous stream of public domain stuff from the collections could be cool
[16:51] <xmc> this is a thing that you can throw together yourself, even :)
[16:52] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
[17:17] *** atomotic has joined #internetarchive
[17:31] *** atomotic has quit IRC (Remote host closed the connection)
[18:24] *** tentkls_i has joined #internetarchive
[18:25] <tentkls_i> what is ia policy on scraping
[18:28] <DFJustin> scraping of the IA? they love it I think, try the python library https://internetarchive.readthedocs.io/en/latest/
[18:29] <tentkls_i> thats good to know :D
[18:34] <tentkls_i> are there any major holders of data here? 10T minimum i've looked at the website you all have. seems very disparate
[18:37] <tentkls_i> i have ~50T all deduped unique looking to merge data via externals. github.com/skrp/MKRX
[18:40] <DFJustin> IA has over 14PB http://archiveteam.org/index.php?title=Internet_Archive_Census
[18:51] <tentkls_i> looks like it is scraped metadata. in actuality to scrap IA would take a PB to extract unique data which is what ive been working on
[18:52] <tentkls_i> hashsums does not mean it is uniq; it just means the container is uniq; 
[18:57] <SketchCow> Literally knock yourself out.
[18:59] <tentkls_i> i expected to find ppl here with data 
[18:59] <tentkls_i> i just started to scrape IA last month and brought in 15T; is no one else interested?
[19:00] <tentkls_i> disparate backups is near worthless. to create a seed it has to be efficent, sterilized and centralized
[19:01] <xmc> you might want to check out the ia.bak project?
[19:01] <tentkls_i> thats what i was referring too as being a failed idea
[19:02] <SketchCow> :)
[19:02] <SketchCow> efficent, sterilized and centralized
[19:02] <tentkls_i> well not to be mean, but if you think about it. if you dont have access to the data ... its not a backup
[19:02] <SketchCow> Well, at least the uniforms will look good
[19:02] <xmc> failed idea?
[19:02] <xmc> i don't see you saying that in this channel, was it somewhere else?
[19:03] <tentkls_i> the main limitations are bandwidth and storage
[19:03] <tentkls_i> if you want to combine efforts you focus on combining bandwidth and centralize data via externals
[19:03] <tentkls_i> but you need a centralized dump or its not a backup
[19:04] <xmc> i have a backup of my laptop on an sd card in my wallet. it's offline, thus unavailable, does that mean it's not a backup?
[19:04] <xmc> no, you don't need centralized
[19:04] <xmc> you just need a reasonable guarantee of being able to recall it as needed
[19:04] <tentkls_i> ia.bak gl in getting those all centralized
[19:05] <tentkls_i> im not demeaning the efforts but there is a glaring error
[19:05] <xmc> ia.bak requires hosts to check in monthly
[19:05] <xmc> otherwise they get marked as "untrusted" and the data is given to someone else to also have a copy of
[19:06] <xmc> and the goal is to have two non-IA copies of everything at all times
[19:07] <xmc> or is there something i'm missing?
[19:07] <tentkls_i> ia can have everything because they have PBs; but us we need to scrape intelligently
[19:08] <tentkls_i> they are indisputeably bloated. you can cut it down drastically via intelligent scraping
[19:08] <xmc> i don't understand what you're saying
[19:10] <tentkls_i> bandwidth is the main limitation. storage is the second. multiple nodes deals with bandwidth via smart scraping / externals to send data 
[19:11] <xmc> i don't know what you mean by "smart scraping" and "externals", please be more detailed
[19:12] <tentkls_i> my server is ~200T raw; i mail externals with data; the reciever takes the data then fills it with their data; in a centralized fashion
[19:12] <xmc> what is "externals"
[19:12] <tentkls_i> external hdds
[19:12] <xmc> oh ok
[19:13] <xmc> ia.bak as i understand it is meant to work in the background, not requiring any attention from the participants once it's set up
[19:13] <tentkls_i> ia can afford to have 20 different encodings of a file. but anyone else can't. so you take the encoding you trust as the stable
[19:13] <xmc> sending physical mail is ... not that
[19:14] <xmc> right
[19:14] <xmc> the "source" is USUALLY enough
[19:14] <xmc> there are (rather rare) instances where you need to keep the derived data
[19:15] <tentkls_i> source?
[19:15] <xmc> "source" being whatever was uploaded originally, "derived" are the different encodings that IA created
[19:17] <tentkls_i> xmc let me give you an example
[19:17] <xmc> ok
[19:17] <tentkls_i> the pdf if 20M; the compressed 'all' version is 700M https://archive.org/download/theoryofliteratu00inwell
[19:17] <xmc> the pdf isn't the source though
[19:18] <xmc> the source data, from the scan, that you REALLY want to keep, is the _orig_jp2.tar file, which is 413 megs
[19:19] <xmc> you can make the pdf again from the scans, but you can't make the scans from the pdf
[19:19] <tentkls_i> yes but in bandwidth limited env. you take what you need
[19:20] <xmc> that's why you do it ahead of time, so it's not an emergency and you can get the good stuff
[19:20] <tentkls_i> pdf is the most likely to be verified. i choose to dl 20 pdfs over one src
[19:20] <xmc> verified?
[19:20] <tentkls_i> meaning someone has likely viewed
[19:21] <xmc> i don't understand, use more words please?
[19:21] <tentkls_i> encodings fk up alot; so you have to go with the one that is likely to be the most widely used
[19:21] <tentkls_i> on the hope, assumption, that any kinks have been ironed out
[19:22] <xmc> pdf is made from source images, which are created by hand and verified by the person operating the scanner
[19:23] <tentkls_i> do you get what i mean by 'intelligent scraping' ? tho you might disagree that is the right adjective
[19:24] <xmc> i would say pragmatic & opinionated, but i understand
[19:24] <tentkls_i> well its like a curator of a museum
[19:24] <xmc> yes
[19:24] <tentkls_i> you have to have your own style or you are just a packrat :D
[19:24] <xmc> i suppose
[19:25] <tentkls_i> getting source is the best you are right, and you made me understand that; but i believe it is not worth 20xs the bandwidth
[19:26] <tentkls_i> i take a collection, investigate what makes it common and make a decision; then i scrape. that is my methode
[19:27] <tentkls_i> i can host all your data; i can front the exernals; and you can request all data; such transaction to scrapers can not take place out of bandwidth
[19:28] <tentkls_i> i can mail a 8T full; that would takes months for the normal person in bandwidth; tho only 3 days shipping and 1 day xfer
[19:35] <tentkls_i> any veteran scraper knows you don't automate scraping; if you 'backup data' has a worse connection than the source then...
[19:36] <xmc> so i'm not sure you understand ia.bak at all
[19:36] <tentkls_i> ive read into it several times; maybe it dont
[19:36] <xmc> but it seems like you are set in your ways and not interested anyway
[19:36] <xmc> anyway, don't stop!
[19:36] <tentkls_i> i dont expect you to agree; just thought i give u a diff perspective
[19:37] <xmc> kool thx
[19:42] *** Asparagir has joined #internetarchive
[19:49] *** tentkls_i has left Leaving
[20:07] *** kyan has joined #internetarchive
[22:40] *** Martini has quit IRC (Ping timeout: 255 seconds)
[23:59] *** X-Scale has quit IRC (Ping timeout: 240 seconds)