#internetarchive 2017-01-03,Tue

↑back Search

Time Nickname Message
00:37 🔗 phuzion has quit IRC (Read error: Operation timed out)
00:42 🔗 phuzion has joined #internetarchive
02:14 🔗 vitzli has joined #internetarchive
08:01 🔗 mistym has joined #internetarchive
09:10 🔗 X-Scale has joined #internetarchive
13:54 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
13:55 🔗 Lord_Nigh has joined #internetarchive
14:14 🔗 vitzli has quit IRC (Quit: Leaving)
15:51 🔗 kyounko|2 has quit IRC (Read error: Connection reset by peer)
15:51 🔗 kyounko|2 has joined #internetarchive
15:58 🔗 atomotic has joined #internetarchive
16:07 🔗 Martini has joined #internetarchive
16:08 🔗 Martini hi
16:13 🔗 Martini I was wondering, since NASA has its own TV Channel. I think that the Internet Archive should have one :)
16:29 🔗 atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
16:40 🔗 atomotic has joined #internetarchive
16:50 🔗 DFJustin they don't have the budget for live programming, but a continuous stream of public domain stuff from the collections could be cool
16:51 🔗 xmc this is a thing that you can throw together yourself, even :)
16:52 🔗 atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
17:17 🔗 atomotic has joined #internetarchive
17:31 🔗 atomotic has quit IRC (Remote host closed the connection)
18:24 🔗 tentkls_i has joined #internetarchive
18:25 🔗 tentkls_i what is ia policy on scraping
18:28 🔗 DFJustin scraping of the IA? they love it I think, try the python library https://internetarchive.readthedocs.io/en/latest/
18:29 🔗 tentkls_i thats good to know :D
18:34 🔗 tentkls_i are there any major holders of data here? 10T minimum i've looked at the website you all have. seems very disparate
18:37 🔗 tentkls_i i have ~50T all deduped unique looking to merge data via externals. github.com/skrp/MKRX
18:40 🔗 DFJustin IA has over 14PB http://archiveteam.org/index.php?title=Internet_Archive_Census
18:51 🔗 tentkls_i looks like it is scraped metadata. in actuality to scrap IA would take a PB to extract unique data which is what ive been working on
18:52 🔗 tentkls_i hashsums does not mean it is uniq; it just means the container is uniq;
18:57 🔗 SketchCow Literally knock yourself out.
18:59 🔗 tentkls_i i expected to find ppl here with data
18:59 🔗 tentkls_i i just started to scrape IA last month and brought in 15T; is no one else interested?
19:00 🔗 tentkls_i disparate backups is near worthless. to create a seed it has to be efficent, sterilized and centralized
19:01 🔗 xmc you might want to check out the ia.bak project?
19:01 🔗 tentkls_i thats what i was referring too as being a failed idea
19:02 🔗 SketchCow :)
19:02 🔗 SketchCow efficent, sterilized and centralized
19:02 🔗 tentkls_i well not to be mean, but if you think about it. if you dont have access to the data ... its not a backup
19:02 🔗 SketchCow Well, at least the uniforms will look good
19:02 🔗 xmc failed idea?
19:02 🔗 xmc i don't see you saying that in this channel, was it somewhere else?
19:03 🔗 tentkls_i the main limitations are bandwidth and storage
19:03 🔗 tentkls_i if you want to combine efforts you focus on combining bandwidth and centralize data via externals
19:03 🔗 tentkls_i but you need a centralized dump or its not a backup
19:04 🔗 xmc i have a backup of my laptop on an sd card in my wallet. it's offline, thus unavailable, does that mean it's not a backup?
19:04 🔗 xmc no, you don't need centralized
19:04 🔗 xmc you just need a reasonable guarantee of being able to recall it as needed
19:04 🔗 tentkls_i ia.bak gl in getting those all centralized
19:05 🔗 tentkls_i im not demeaning the efforts but there is a glaring error
19:05 🔗 xmc ia.bak requires hosts to check in monthly
19:05 🔗 xmc otherwise they get marked as "untrusted" and the data is given to someone else to also have a copy of
19:06 🔗 xmc and the goal is to have two non-IA copies of everything at all times
19:07 🔗 xmc or is there something i'm missing?
19:07 🔗 tentkls_i ia can have everything because they have PBs; but us we need to scrape intelligently
19:08 🔗 tentkls_i they are indisputeably bloated. you can cut it down drastically via intelligent scraping
19:08 🔗 xmc i don't understand what you're saying
19:10 🔗 tentkls_i bandwidth is the main limitation. storage is the second. multiple nodes deals with bandwidth via smart scraping / externals to send data
19:11 🔗 xmc i don't know what you mean by "smart scraping" and "externals", please be more detailed
19:12 🔗 tentkls_i my server is ~200T raw; i mail externals with data; the reciever takes the data then fills it with their data; in a centralized fashion
19:12 🔗 xmc what is "externals"
19:12 🔗 tentkls_i external hdds
19:12 🔗 xmc oh ok
19:13 🔗 xmc ia.bak as i understand it is meant to work in the background, not requiring any attention from the participants once it's set up
19:13 🔗 tentkls_i ia can afford to have 20 different encodings of a file. but anyone else can't. so you take the encoding you trust as the stable
19:13 🔗 xmc sending physical mail is ... not that
19:14 🔗 xmc right
19:14 🔗 xmc the "source" is USUALLY enough
19:14 🔗 xmc there are (rather rare) instances where you need to keep the derived data
19:15 🔗 tentkls_i source?
19:15 🔗 xmc "source" being whatever was uploaded originally, "derived" are the different encodings that IA created
19:17 🔗 tentkls_i xmc let me give you an example
19:17 🔗 xmc ok
19:17 🔗 tentkls_i the pdf if 20M; the compressed 'all' version is 700M https://archive.org/download/theoryofliteratu00inwell
19:17 🔗 xmc the pdf isn't the source though
19:18 🔗 xmc the source data, from the scan, that you REALLY want to keep, is the _orig_jp2.tar file, which is 413 megs
19:19 🔗 xmc you can make the pdf again from the scans, but you can't make the scans from the pdf
19:19 🔗 tentkls_i yes but in bandwidth limited env. you take what you need
19:20 🔗 xmc that's why you do it ahead of time, so it's not an emergency and you can get the good stuff
19:20 🔗 tentkls_i pdf is the most likely to be verified. i choose to dl 20 pdfs over one src
19:20 🔗 xmc verified?
19:20 🔗 tentkls_i meaning someone has likely viewed
19:21 🔗 xmc i don't understand, use more words please?
19:21 🔗 tentkls_i encodings fk up alot; so you have to go with the one that is likely to be the most widely used
19:21 🔗 tentkls_i on the hope, assumption, that any kinks have been ironed out
19:22 🔗 xmc pdf is made from source images, which are created by hand and verified by the person operating the scanner
19:23 🔗 tentkls_i do you get what i mean by 'intelligent scraping' ? tho you might disagree that is the right adjective
19:24 🔗 xmc i would say pragmatic & opinionated, but i understand
19:24 🔗 tentkls_i well its like a curator of a museum
19:24 🔗 xmc yes
19:24 🔗 tentkls_i you have to have your own style or you are just a packrat :D
19:24 🔗 xmc i suppose
19:25 🔗 tentkls_i getting source is the best you are right, and you made me understand that; but i believe it is not worth 20xs the bandwidth
19:26 🔗 tentkls_i i take a collection, investigate what makes it common and make a decision; then i scrape. that is my methode
19:27 🔗 tentkls_i i can host all your data; i can front the exernals; and you can request all data; such transaction to scrapers can not take place out of bandwidth
19:28 🔗 tentkls_i i can mail a 8T full; that would takes months for the normal person in bandwidth; tho only 3 days shipping and 1 day xfer
19:35 🔗 tentkls_i any veteran scraper knows you don't automate scraping; if you 'backup data' has a worse connection than the source then...
19:36 🔗 xmc so i'm not sure you understand ia.bak at all
19:36 🔗 tentkls_i ive read into it several times; maybe it dont
19:36 🔗 xmc but it seems like you are set in your ways and not interested anyway
19:36 🔗 xmc anyway, don't stop!
19:36 🔗 tentkls_i i dont expect you to agree; just thought i give u a diff perspective
19:37 🔗 xmc kool thx
19:42 🔗 Asparagir has joined #internetarchive
19:49 🔗 tentkls_i has left Leaving
20:07 🔗 kyan has joined #internetarchive
22:40 🔗 Martini has quit IRC (Ping timeout: 255 seconds)
23:59 🔗 X-Scale has quit IRC (Ping timeout: 240 seconds)

irclogger-viewer