[00:37] *** phuzion has quit IRC (Read error: Operation timed out) [00:42] *** phuzion has joined #internetarchive [02:14] *** vitzli has joined #internetarchive [08:01] *** mistym has joined #internetarchive [09:10] *** X-Scale has joined #internetarchive [13:54] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [13:55] *** Lord_Nigh has joined #internetarchive [14:14] *** vitzli has quit IRC (Quit: Leaving) [15:51] *** kyounko|2 has quit IRC (Read error: Connection reset by peer) [15:51] *** kyounko|2 has joined #internetarchive [15:58] *** atomotic has joined #internetarchive [16:07] *** Martini has joined #internetarchive [16:08] hi [16:13] I was wondering, since NASA has its own TV Channel. I think that the Internet Archive should have one :) [16:29] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [16:40] *** atomotic has joined #internetarchive [16:50] they don't have the budget for live programming, but a continuous stream of public domain stuff from the collections could be cool [16:51] this is a thing that you can throw together yourself, even :) [16:52] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [17:17] *** atomotic has joined #internetarchive [17:31] *** atomotic has quit IRC (Remote host closed the connection) [18:24] *** tentkls_i has joined #internetarchive [18:25] what is ia policy on scraping [18:28] scraping of the IA? they love it I think, try the python library https://internetarchive.readthedocs.io/en/latest/ [18:29] thats good to know :D [18:34] are there any major holders of data here? 10T minimum i've looked at the website you all have. seems very disparate [18:37] i have ~50T all deduped unique looking to merge data via externals. github.com/skrp/MKRX [18:40] IA has over 14PB http://archiveteam.org/index.php?title=Internet_Archive_Census [18:51] looks like it is scraped metadata. in actuality to scrap IA would take a PB to extract unique data which is what ive been working on [18:52] hashsums does not mean it is uniq; it just means the container is uniq; [18:57] Literally knock yourself out. [18:59] i expected to find ppl here with data [18:59] i just started to scrape IA last month and brought in 15T; is no one else interested? [19:00] disparate backups is near worthless. to create a seed it has to be efficent, sterilized and centralized [19:01] you might want to check out the ia.bak project? [19:01] thats what i was referring too as being a failed idea [19:02] :) [19:02] efficent, sterilized and centralized [19:02] well not to be mean, but if you think about it. if you dont have access to the data ... its not a backup [19:02] Well, at least the uniforms will look good [19:02] failed idea? [19:02] i don't see you saying that in this channel, was it somewhere else? [19:03] the main limitations are bandwidth and storage [19:03] if you want to combine efforts you focus on combining bandwidth and centralize data via externals [19:03] but you need a centralized dump or its not a backup [19:04] i have a backup of my laptop on an sd card in my wallet. it's offline, thus unavailable, does that mean it's not a backup? [19:04] no, you don't need centralized [19:04] you just need a reasonable guarantee of being able to recall it as needed [19:04] ia.bak gl in getting those all centralized [19:05] im not demeaning the efforts but there is a glaring error [19:05] ia.bak requires hosts to check in monthly [19:05] otherwise they get marked as "untrusted" and the data is given to someone else to also have a copy of [19:06] and the goal is to have two non-IA copies of everything at all times [19:07] or is there something i'm missing? [19:07] ia can have everything because they have PBs; but us we need to scrape intelligently [19:08] they are indisputeably bloated. you can cut it down drastically via intelligent scraping [19:08] i don't understand what you're saying [19:10] bandwidth is the main limitation. storage is the second. multiple nodes deals with bandwidth via smart scraping / externals to send data [19:11] i don't know what you mean by "smart scraping" and "externals", please be more detailed [19:12] my server is ~200T raw; i mail externals with data; the reciever takes the data then fills it with their data; in a centralized fashion [19:12] what is "externals" [19:12] external hdds [19:12] oh ok [19:13] ia.bak as i understand it is meant to work in the background, not requiring any attention from the participants once it's set up [19:13] ia can afford to have 20 different encodings of a file. but anyone else can't. so you take the encoding you trust as the stable [19:13] sending physical mail is ... not that [19:14] right [19:14] the "source" is USUALLY enough [19:14] there are (rather rare) instances where you need to keep the derived data [19:15] source? [19:15] "source" being whatever was uploaded originally, "derived" are the different encodings that IA created [19:17] xmc let me give you an example [19:17] ok [19:17] the pdf if 20M; the compressed 'all' version is 700M https://archive.org/download/theoryofliteratu00inwell [19:17] the pdf isn't the source though [19:18] the source data, from the scan, that you REALLY want to keep, is the _orig_jp2.tar file, which is 413 megs [19:19] you can make the pdf again from the scans, but you can't make the scans from the pdf [19:19] yes but in bandwidth limited env. you take what you need [19:20] that's why you do it ahead of time, so it's not an emergency and you can get the good stuff [19:20] pdf is the most likely to be verified. i choose to dl 20 pdfs over one src [19:20] verified? [19:20] meaning someone has likely viewed [19:21] i don't understand, use more words please? [19:21] encodings fk up alot; so you have to go with the one that is likely to be the most widely used [19:21] on the hope, assumption, that any kinks have been ironed out [19:22] pdf is made from source images, which are created by hand and verified by the person operating the scanner [19:23] do you get what i mean by 'intelligent scraping' ? tho you might disagree that is the right adjective [19:24] i would say pragmatic & opinionated, but i understand [19:24] well its like a curator of a museum [19:24] yes [19:24] you have to have your own style or you are just a packrat :D [19:24] i suppose [19:25] getting source is the best you are right, and you made me understand that; but i believe it is not worth 20xs the bandwidth [19:26] i take a collection, investigate what makes it common and make a decision; then i scrape. that is my methode [19:27] i can host all your data; i can front the exernals; and you can request all data; such transaction to scrapers can not take place out of bandwidth [19:28] i can mail a 8T full; that would takes months for the normal person in bandwidth; tho only 3 days shipping and 1 day xfer [19:35] any veteran scraper knows you don't automate scraping; if you 'backup data' has a worse connection than the source then... [19:36] so i'm not sure you understand ia.bak at all [19:36] ive read into it several times; maybe it dont [19:36] but it seems like you are set in your ways and not interested anyway [19:36] anyway, don't stop! [19:36] i dont expect you to agree; just thought i give u a diff perspective [19:37] kool thx [19:42] *** Asparagir has joined #internetarchive [19:49] *** tentkls_i has left Leaving [20:07] *** kyan has joined #internetarchive [22:40] *** Martini has quit IRC (Ping timeout: 255 seconds) [23:59] *** X-Scale has quit IRC (Ping timeout: 240 seconds)