Time |
Nickname |
Message |
00:37
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
00:42
🔗
|
|
phuzion has joined #internetarchive |
02:14
🔗
|
|
vitzli has joined #internetarchive |
08:01
🔗
|
|
mistym has joined #internetarchive |
09:10
🔗
|
|
X-Scale has joined #internetarchive |
13:54
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
13:55
🔗
|
|
Lord_Nigh has joined #internetarchive |
14:14
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
15:51
🔗
|
|
kyounko|2 has quit IRC (Read error: Connection reset by peer) |
15:51
🔗
|
|
kyounko|2 has joined #internetarchive |
15:58
🔗
|
|
atomotic has joined #internetarchive |
16:07
🔗
|
|
Martini has joined #internetarchive |
16:08
🔗
|
Martini |
hi |
16:13
🔗
|
Martini |
I was wondering, since NASA has its own TV Channel. I think that the Internet Archive should have one :) |
16:29
🔗
|
|
atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
16:40
🔗
|
|
atomotic has joined #internetarchive |
16:50
🔗
|
DFJustin |
they don't have the budget for live programming, but a continuous stream of public domain stuff from the collections could be cool |
16:51
🔗
|
xmc |
this is a thing that you can throw together yourself, even :) |
16:52
🔗
|
|
atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
17:17
🔗
|
|
atomotic has joined #internetarchive |
17:31
🔗
|
|
atomotic has quit IRC (Remote host closed the connection) |
18:24
🔗
|
|
tentkls_i has joined #internetarchive |
18:25
🔗
|
tentkls_i |
what is ia policy on scraping |
18:28
🔗
|
DFJustin |
scraping of the IA? they love it I think, try the python library https://internetarchive.readthedocs.io/en/latest/ |
18:29
🔗
|
tentkls_i |
thats good to know :D |
18:34
🔗
|
tentkls_i |
are there any major holders of data here? 10T minimum i've looked at the website you all have. seems very disparate |
18:37
🔗
|
tentkls_i |
i have ~50T all deduped unique looking to merge data via externals. github.com/skrp/MKRX |
18:40
🔗
|
DFJustin |
IA has over 14PB http://archiveteam.org/index.php?title=Internet_Archive_Census |
18:51
🔗
|
tentkls_i |
looks like it is scraped metadata. in actuality to scrap IA would take a PB to extract unique data which is what ive been working on |
18:52
🔗
|
tentkls_i |
hashsums does not mean it is uniq; it just means the container is uniq; |
18:57
🔗
|
SketchCow |
Literally knock yourself out. |
18:59
🔗
|
tentkls_i |
i expected to find ppl here with data |
18:59
🔗
|
tentkls_i |
i just started to scrape IA last month and brought in 15T; is no one else interested? |
19:00
🔗
|
tentkls_i |
disparate backups is near worthless. to create a seed it has to be efficent, sterilized and centralized |
19:01
🔗
|
xmc |
you might want to check out the ia.bak project? |
19:01
🔗
|
tentkls_i |
thats what i was referring too as being a failed idea |
19:02
🔗
|
SketchCow |
:) |
19:02
🔗
|
SketchCow |
efficent, sterilized and centralized |
19:02
🔗
|
tentkls_i |
well not to be mean, but if you think about it. if you dont have access to the data ... its not a backup |
19:02
🔗
|
SketchCow |
Well, at least the uniforms will look good |
19:02
🔗
|
xmc |
failed idea? |
19:02
🔗
|
xmc |
i don't see you saying that in this channel, was it somewhere else? |
19:03
🔗
|
tentkls_i |
the main limitations are bandwidth and storage |
19:03
🔗
|
tentkls_i |
if you want to combine efforts you focus on combining bandwidth and centralize data via externals |
19:03
🔗
|
tentkls_i |
but you need a centralized dump or its not a backup |
19:04
🔗
|
xmc |
i have a backup of my laptop on an sd card in my wallet. it's offline, thus unavailable, does that mean it's not a backup? |
19:04
🔗
|
xmc |
no, you don't need centralized |
19:04
🔗
|
xmc |
you just need a reasonable guarantee of being able to recall it as needed |
19:04
🔗
|
tentkls_i |
ia.bak gl in getting those all centralized |
19:05
🔗
|
tentkls_i |
im not demeaning the efforts but there is a glaring error |
19:05
🔗
|
xmc |
ia.bak requires hosts to check in monthly |
19:05
🔗
|
xmc |
otherwise they get marked as "untrusted" and the data is given to someone else to also have a copy of |
19:06
🔗
|
xmc |
and the goal is to have two non-IA copies of everything at all times |
19:07
🔗
|
xmc |
or is there something i'm missing? |
19:07
🔗
|
tentkls_i |
ia can have everything because they have PBs; but us we need to scrape intelligently |
19:08
🔗
|
tentkls_i |
they are indisputeably bloated. you can cut it down drastically via intelligent scraping |
19:08
🔗
|
xmc |
i don't understand what you're saying |
19:10
🔗
|
tentkls_i |
bandwidth is the main limitation. storage is the second. multiple nodes deals with bandwidth via smart scraping / externals to send data |
19:11
🔗
|
xmc |
i don't know what you mean by "smart scraping" and "externals", please be more detailed |
19:12
🔗
|
tentkls_i |
my server is ~200T raw; i mail externals with data; the reciever takes the data then fills it with their data; in a centralized fashion |
19:12
🔗
|
xmc |
what is "externals" |
19:12
🔗
|
tentkls_i |
external hdds |
19:12
🔗
|
xmc |
oh ok |
19:13
🔗
|
xmc |
ia.bak as i understand it is meant to work in the background, not requiring any attention from the participants once it's set up |
19:13
🔗
|
tentkls_i |
ia can afford to have 20 different encodings of a file. but anyone else can't. so you take the encoding you trust as the stable |
19:13
🔗
|
xmc |
sending physical mail is ... not that |
19:14
🔗
|
xmc |
right |
19:14
🔗
|
xmc |
the "source" is USUALLY enough |
19:14
🔗
|
xmc |
there are (rather rare) instances where you need to keep the derived data |
19:15
🔗
|
tentkls_i |
source? |
19:15
🔗
|
xmc |
"source" being whatever was uploaded originally, "derived" are the different encodings that IA created |
19:17
🔗
|
tentkls_i |
xmc let me give you an example |
19:17
🔗
|
xmc |
ok |
19:17
🔗
|
tentkls_i |
the pdf if 20M; the compressed 'all' version is 700M https://archive.org/download/theoryofliteratu00inwell |
19:17
🔗
|
xmc |
the pdf isn't the source though |
19:18
🔗
|
xmc |
the source data, from the scan, that you REALLY want to keep, is the _orig_jp2.tar file, which is 413 megs |
19:19
🔗
|
xmc |
you can make the pdf again from the scans, but you can't make the scans from the pdf |
19:19
🔗
|
tentkls_i |
yes but in bandwidth limited env. you take what you need |
19:20
🔗
|
xmc |
that's why you do it ahead of time, so it's not an emergency and you can get the good stuff |
19:20
🔗
|
tentkls_i |
pdf is the most likely to be verified. i choose to dl 20 pdfs over one src |
19:20
🔗
|
xmc |
verified? |
19:20
🔗
|
tentkls_i |
meaning someone has likely viewed |
19:21
🔗
|
xmc |
i don't understand, use more words please? |
19:21
🔗
|
tentkls_i |
encodings fk up alot; so you have to go with the one that is likely to be the most widely used |
19:21
🔗
|
tentkls_i |
on the hope, assumption, that any kinks have been ironed out |
19:22
🔗
|
xmc |
pdf is made from source images, which are created by hand and verified by the person operating the scanner |
19:23
🔗
|
tentkls_i |
do you get what i mean by 'intelligent scraping' ? tho you might disagree that is the right adjective |
19:24
🔗
|
xmc |
i would say pragmatic & opinionated, but i understand |
19:24
🔗
|
tentkls_i |
well its like a curator of a museum |
19:24
🔗
|
xmc |
yes |
19:24
🔗
|
tentkls_i |
you have to have your own style or you are just a packrat :D |
19:24
🔗
|
xmc |
i suppose |
19:25
🔗
|
tentkls_i |
getting source is the best you are right, and you made me understand that; but i believe it is not worth 20xs the bandwidth |
19:26
🔗
|
tentkls_i |
i take a collection, investigate what makes it common and make a decision; then i scrape. that is my methode |
19:27
🔗
|
tentkls_i |
i can host all your data; i can front the exernals; and you can request all data; such transaction to scrapers can not take place out of bandwidth |
19:28
🔗
|
tentkls_i |
i can mail a 8T full; that would takes months for the normal person in bandwidth; tho only 3 days shipping and 1 day xfer |
19:35
🔗
|
tentkls_i |
any veteran scraper knows you don't automate scraping; if you 'backup data' has a worse connection than the source then... |
19:36
🔗
|
xmc |
so i'm not sure you understand ia.bak at all |
19:36
🔗
|
tentkls_i |
ive read into it several times; maybe it dont |
19:36
🔗
|
xmc |
but it seems like you are set in your ways and not interested anyway |
19:36
🔗
|
xmc |
anyway, don't stop! |
19:36
🔗
|
tentkls_i |
i dont expect you to agree; just thought i give u a diff perspective |
19:37
🔗
|
xmc |
kool thx |
19:42
🔗
|
|
Asparagir has joined #internetarchive |
19:49
🔗
|
|
tentkls_i has left Leaving |
20:07
🔗
|
|
kyan has joined #internetarchive |
22:40
🔗
|
|
Martini has quit IRC (Ping timeout: 255 seconds) |
23:59
🔗
|
|
X-Scale has quit IRC (Ping timeout: 240 seconds) |