#archiveteam-bs 2017-09-24,Sun

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***Odd0002 has joined #archiveteam-bs [00:04]
Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
Odd0002 has joined #archiveteam-bs
[00:10]
Odd0002 has quit IRC (ZNC - http://znc.in)
Odd0002 has joined #archiveteam-bs
[00:18]
dd0a13f37Anyone want to download all the issues of Dagens Nyheter and Dagens Industri? I can get all the pdf links and the auth cookie, but they're 80-90mb a piece and archivebot won't handle auth cookie
1 per day, 3 different papers total, only the last 3 years would be 100gb
[00:28]
They're both fairly large newspapers in sweden [00:34]
.... (idle for 16mn)
VADemonNow that's something I'd happily do. Yet 1 per day and the 3 last years? [00:50]
dd0a13f37What do you mean? They're a newspaper, these are pdf renders of it
There's only 1/day, Dagens industri only has going back last 3 years (DN has longer)
[00:51]
VADemonone issue of the newspaper per day or one download per day (rate limit) [00:53]
dd0a13f37One issue per day
Gets published
[01:00]
***BlueMaxim has quit IRC (Quit: Leaving)
Aranje has joined #archiveteam-bs
[01:02]
secondmundus: Did you convert these ebooks to epub or is that the original? [01:04]
munduswhat? [01:04]
secondhttp://dh.mundus.xyz/requests/Stephen%20King%20-%20The%20Dark%20Tower%20Series/ [01:05]
mundusthat's what I downloaded from bib [01:05]
secondWhat is bib? [01:05]
mundusbiblotik, largest ebook tracker [01:05]
***refeed has joined #archiveteam-bs [01:06]
dd0a13f37Isn't libgen larger? [01:06]
secondmundus: and how might I get access to this? [01:07]
mundusdunno
find an invite thread on other trackers
dd0a13f37, I thought bib was the biggest, not heard of libgen
[01:07]
dd0a13f372 million books (science/technology) + fiction + nearly all scientifical journals [01:08]
***swebb has quit IRC (Read error: Operation timed out)
atlogbot has quit IRC (Read error: Operation timed out)
[01:08]
mundusTorrents: 296700
spose so
[01:10]
dd0a13f37libgen.io [01:12]
mundusah, libgen isn't a tracker [01:12]
secondWhat is in this 2015.tar.gz ? mundus [01:13]
munduspack of books [01:13]
dd0a13f37VADemon: my.mixtape.moe/fwijcd.txt my.mixtape.moe/eazdrw.txt my.mixtape.moe/oghmyp.txt see query for auth cookie [01:13]
mundushttps://mirror.mundus.xyz/drive/Archives/TrainPacks/ [01:14]
secondmundus: you don't have any bandwith limits do you? [01:14]
mundusno [01:14]
secondhmm
mundus: are there duplicates in these packs?
[01:14]
mundusno [01:15]
dd0a13f37It's not a tracker, but it operates in the same way [01:16]
secondooh boy
mundus: is there metadata with these books?
mundus: mind if I grab?
[01:16]
mundusdd0a13f37, so it follows DMCA takedowns?
I don't know
[01:16]
dd0a13f37No [01:17]
mundusgo ahead [01:17]
secondthank you [01:17]
dd0a13f37They don't care, their hosting is in the seychelles and domain in indian ocean [01:17]
***r3c0d3x has quit IRC (Ping timeout: 260 seconds) [01:17]
secondmundus: what is here? https://mirror.mundus.xyz/drive/Keys/ [01:17]
munduskeys lol [01:18]
secondMany of your dirs 404
I can't read it though...
[01:18]
mundusif they 404 refresh [01:18]
secondkeys requires a password [01:19]
dd0a13f37Those should not be in a web-exposed directory, even if it has password auth [01:19]
mundusmeh [01:19]
***r3c0d3x has joined #archiveteam-bs [01:20]
secondmundus: see https://mirror.mundus.xyz/drive/Pictures/ [01:20]
mundusYeah, that has a password [01:20]
dd0a13f37If you want large amounts of books as torrents you can download libgen's torrents, you can also download their database (only 6gb), that contains hashes of all the books in the most common formats (when they coded it)
MD5, TTH, ED2K, something more
[01:20]
mundusbecause I don't need any of you fucks seein my pics [01:20]
dd0a13f37You can use that to bulk classify unsorted books [01:21]
***Aranje has quit IRC (Ping timeout: 506 seconds) [01:22]
secondmundus: what kind of pics? [01:24]
mundusof my family? [01:25]
secondahh, so you have family, ok [01:27]
munduseh [01:27]
secondits fine, good to have family
mundus: can you unlock this https://mirror.mundus.xyz/drive/Keys/
And thanks! :D
[01:28]
mundusno [01:28]
secondaww
is there a key to it?
[01:28]
mundusit's like ssh keys [01:28]
Froggingsecond: why are you asking dumb questions [01:28]
secondoooo
mundus: I thought it was software keys and stuff, nvm
Frogging: curious
mundus: thank you very very much for the data
[01:28]
dd0a13f37sure hope there's no timing attacks in the password protection or anything like that, as we all know vulnerabilities in obscure features added as an afterthought are extremely rare [01:30]
***BlueMaxim has joined #archiveteam-bs [01:30]
mundusif someone pwnd it I would give zero fucks [01:30]
***r3c0d3x has quit IRC (Read error: Connection timed out)
r3c0d3x has joined #archiveteam-bs
[01:31]
secondmundus: how big is this directory if its not going to take you too much work to figure out https://mirror.mundus.xyz/drive/Manga/ [01:32]
mundus538GB [01:33]
secondthank you [01:33]
Froggingthis is cool mundus, thanks for sharing [01:33]
mundusnp [01:33]
dd0a13f37You should upload it somewhere [01:34]
Frogginglike where [01:34]
munduswhat? [01:34]
secondit is uploaded "somewhere" [01:34]
dd0a13f37http://libgen.io/comics/index.php
The contents of the drive
[01:34]
mundus... [01:34]
Froggingwhat? [01:35]
dd0a13f37???
Libgen accepts larger uploads via FTP
[01:35]
mundusYou're saying upload the trainpacks stuff? [01:35]
dd0a13f37Yes, or would they already have it all? [01:36]
mundusohh [01:36]
dd0a13f37And also the manga, I think they would accept it under comics [01:36]
mundusI thought you were saying the hwole thing
isn't manga video
or is that comics
[01:36]
dd0a13f37Anime is video, manga is comics [01:37]
mundusuh I wish there was a better way I could host this stuff
I get too much traffic
[01:38]
dd0a13f37Torrents? [01:38]
Froggingwhere's it hosted now? [01:38]
mundusGoogle Drive + Caching FS + Scaleway
But I have no money
so
this is the result
[01:38]
dd0a13f37Is google drive the backend? What? [01:39]
mundusyes [01:39]
dd0a13f37Isn't that expensive as fuck? [01:39]
mundusNo [01:39]
dd0a13f37Or are you using lots of accounts? [01:39]
mundusI have unlimited storage through school [01:39]
Froggingayyyy
is it encrypted on Drive?
[01:39]
dd0a13f37I always thought that just meant 20gbs instead of 2 [01:40]
mundusYes
And I have an ebay account too
which it's all mirrored to
unencrypted
cuz idgak
[01:40]
Froggingebay? [01:41]
dd0a13f37Does ebay have cloud storage? [01:41]
mundus*idgaf [01:41]
dd0a13f37What about using torrents? You could use the server as webseed [01:41]
mundusNo, I bought an unlimited account off ebay
that's a lot of hashing
just what I have shared is 150TB
[01:41]
dd0a13f37So 5 months at 100mbit
Could gradually switch it over I guess, some of it might already be hashed if you got it as a whole
[01:42]
mundusI dream of https://www.online.net/en/dedicated-server/dedibox-st48 [01:44]
dd0a13f37At $4.5/tb-month you would recoup the costs pretty quickly by buying hard drives and hosting yourself
and use some cheap VPS as reverse proxy
[01:47]
mundusI have a shit connection
And my parents would never let me spend that much
[01:47]
dd0a13f37But $200/mo is okay? [01:48]
mundus>Dream of
no :p
[01:48]
dd0a13f37What's taking up the majority of the space? [01:49]
***drumstick has quit IRC (Remote host closed the connection) [01:52]
ruunyan has joined #archiveteam-bs [02:04]
secondmundus: I must mirror your stuff quicker then... [02:05]
dd0a13f37A lot of it already is, search for the filenames on btdig.com [02:13]
***etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [02:26]
swebb has joined #archiveteam-bs
svchfoo1 sets mode: +o swebb
[02:33]
VADemondd0a13f37: ind downloaded 791 files, 13G; wee 133 files and 1.9G; ovrigt 255 files and 3.9G [02:35]
dd0a13f37791 di_all.txt 255 dio_all.txt 133 diw_all.txt [02:43]
secondmundus: do you have this book? cooking for geeks 2nd edition [02:44]
dd0a13f37http://libgen.io/book/index.php?md5=994D5F0D6F0D2C4F8107FCEF98080698 [02:47]
secondthanks [02:53]
mundushave we archived libgen? [03:07]
dd0a13f37No, but they provide repository torrents, are uploaded to usenet, and are backed up in various other places, maybe the internet archive [03:09]
Froggingmundus: I love dmca.mp4 [03:13]
mundusthanks :) [03:13]
Froggingwhat is the source? I must know :p [03:13]
mundusI need to make sure that's on all my servers
idk
[03:13]
dd0a13f37Question about the !yahoo switch
Is 4 workers the best for performance?
[03:15]
Froggingnot necessarily
more workers = more load on the site
[03:16]
godaneso i'm looking at patreon to get money to buy vhs tapes [03:16]
dd0a13f37But if they have a powerful CDN [03:17]
Froggingand they won't block you? sure [03:17]
dd0a13f37After a certain point, it won't do anything - 1 million workers will just kill the pipeline
so where's the "maximum"?
[03:17]
Froggingdunno if there is one but most people just use the default of 3. there's no need to use more than that in most cases [03:18]
***dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [03:32]
.... (idle for 18mn)
zenguy has quit IRC (Read error: Operation timed out) [03:50]
drumstick has joined #archiveteam-bs [03:55]
......... (idle for 42mn)
_refeed_ has joined #archiveteam-bs
refeed has quit IRC (Read error: Connection reset by peer)
Sk1d has quit IRC (Ping timeout: 250 seconds)
pizzaiolo has quit IRC (Ping timeout: 260 seconds)
[04:37]
Sk1d has joined #archiveteam-bs
pizzaiolo has joined #archiveteam-bs
[04:46]
................... (idle for 1h30mn)
__refeed_ has joined #archiveteam-bs [06:19]
_refeed_ has quit IRC (Ping timeout: 600 seconds) [06:27]
..... (idle for 23mn)
schbirid has joined #archiveteam-bs [06:50]
.... (idle for 19mn)
__refeed_ is now known as refeed [07:09]
refeedis there a https://github.com/bibanon/tubeup maintainer around here? [07:10]
....... (idle for 33mn)
***BartoCH has quit IRC (Ping timeout: 260 seconds) [07:43]
VADemonrefeed: #bibanon on Rizon IRC [07:49]
***VADemon has quit IRC (Quit: left4dead) [07:49]
refeedthx [07:50]
***BartoCH has joined #archiveteam-bs [07:52]
................ (idle for 1h15mn)
JAAmundus: IA most likely has an archive of Library Genesis.
But not publicly accessible.
[09:07]
mundusthey ave not publicly accessible stuff?
*have
[09:08]
JAAYes, tons of it.
All of Wayback Machine isn't publicly accessible, for example.
Well, everything archived by IA's crawlers etc.
If the copyright holder complains about an item, they'll also block access but not delete it, by the way.
[09:11]
...... (idle for 25mn)
schbiridmy libgen scimag effots are pretty dead. the torrents are not well seeded :( [09:38]
***dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[09:51]
.......... (idle for 47mn)
Dimtree has quit IRC (Read error: Operation timed out) [10:42]
..... (idle for 23mn)
Dimtree has joined #archiveteam-bs
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
drumstick has quit IRC (Read error: Operation timed out)
[11:05]
brayden has quit IRC (Ping timeout: 255 seconds)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[11:18]
brayden has quit IRC (Ping timeout: 255 seconds)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[11:26]
........ (idle for 35mn)
refeed has quit IRC (Ping timeout: 600 seconds) [12:03]
.... (idle for 18mn)
brayden has quit IRC (Ping timeout: 255 seconds)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
Soni has quit IRC (Read error: Operation timed out)
[12:21]
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
Soni has joined #archiveteam-bs
[12:35]
..... (idle for 22mn)
dd0a13f37 has joined #archiveteam-bs [13:01]
...... (idle for 29mn)
refeed has joined #archiveteam-bs [13:30]
..... (idle for 23mn)
BlueMaxim has quit IRC (Quit: Leaving) [13:53]
...... (idle for 29mn)
SketchCowHad a request about "Deez Nutz" [14:22]
refeedyup [14:22]
SketchCowhttps://www.youtube.com/watch?v=uODUnXf-7qc
Here's the Fat Boys in 1987 with a song called "My Nuts"
[14:22]
refeedrefeed is watching the video [14:23]
dd0a13f37Is there a limit to how many URLs curl can take in oen command [14:24]
SketchCowAnd then in 1992, five years Later, Dr. Dre released a song called "Deez Nuuts".
http://www.urbandictionary.com/define.php?term=deez%20nutz
[14:24]
dd0a13f37I did something like `curl 'something/search.php?page_number='{0..2000}`
And it only got about halfway through
[14:25]
nvm, i miscalculated the offsets [14:30]
refeed>http://www.urbandictionary.com/define.php?term=deez%20nutz , okay now I understand ._. [14:36]
.... (idle for 18mn)
***mls has quit IRC (Ping timeout: 250 seconds)
mls has joined #archiveteam-bs
[14:54]
....... (idle for 31mn)
refeed has quit IRC (Leaving)
mls has quit IRC (Ping timeout: 250 seconds)
[15:26]
..... (idle for 23mn)
mls has joined #archiveteam-bs [15:49]
brayden has quit IRC (Read error: Connection reset by peer)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
brayden has quit IRC (Read error: Connection reset by peer)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[16:00]
.... (idle for 17mn)
Soni has quit IRC (Ping timeout: 255 seconds) [16:19]
Soni has joined #archiveteam-bs [16:24]
odemg has quit IRC (Quit: Leaving)
dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
[16:31]
JAAdd0a13f37: The only limit I can think of is the kernel's maximum command length limit. [16:36]
***dd0a13f37 has joined #archiveteam-bs [16:45]
.... (idle for 19mn)
dd0a13f37What's the fastest concurrency you should use if you're interacting with a powerful server and you're not ratelimited? At some point, it won't do anything. Right now I'm using 20, but should I crank it up further? [17:04]
JAADepends on your machine and the network, I'd say. [17:04]
dd0a13f37For example, would 200 make things faster? What about 2000? 20000? At some point you're limited by internet, obviously (10mbit approx)
I'm using aria2c, so CPU load isn't a problem
[17:04]
JAAI guess you'll just have to test. Check how many responses you get per time for different concurrencies, then pick the point of diminishing returns. [17:06]
dd0a13f37That seems like a reasonable idea [17:08]
Another question: I have a list of links, they have 2 parameters, call them issue and pagenum
each issue is a number of pages long, so if pagenum N exists then pagenum N-1 exists
What's the best way to download all these? Get random sample of 1000 urls, increment page number by 1 until 0 matches, then feed to archivebot and put up with very ghigh error%
(a very high rate of errors)
[17:14]
JAAI think this would be best solved with wpull and a hook script. You throw page 1 for each issue into wpull. The hook checks whether page N exists and adds page N+1 if so.
That wastes one request per issue.
It won't work with ArchiveBot though, obviously.
Any idea how large this is?
[17:18]
dd0a13f37Data dependencies also
No idea, 43618 issues, X pages per issue, each page is a jpg file 1950x? px
one jpg is maybe 500k-1mb
one issue is 20-50 pages, seems to vary
[17:21]
JAAHmm, I see. That's a bit too large for me currently. [17:24]
dd0a13f37But sending a missed HTTP request, is that such a big deal? [17:27]
JAAI've seen some IP bans after too many 404s per time, but generally probably not. I'd be more worried about potentially missing pages. [17:29]
dd0a13f37So I can do like I said, get the highest page number, then generate a list of pages to get?
Does akamai ban for 404?
[17:30]
JAAYeah, you can try that. We can always go through the logs later and see which issues had no 404s, then look into those in detail and queue any missed pages.
No idea. I don't think I've archived anything from Akamai yet (at least not knowingly).
But I doubt it, to be honest. Most of those that I've seen were obscure little websites.
[17:32]
dd0a13f37Well, the time-critical part of the archiving is done anyway
So you probably have time, yeah
[17:33]
Bloody hell, there's lots that have >100 pages while the majority are around 40, this will be much harder than I thought [17:44]
JAAHm, yeah, then the other way with wpull and a hook script might be better.
It'll be a few TB then, I guess.
[17:45]
dd0a13f37Or you could autogen 200 requests/issue which should be plenty, then feed it into archivebot - at an overhead at 1kb/request and an average of 20pages/issue, this gives you 180kb wasted/issue, which is <2% of total
Or will it take too much CPU?
[17:49]
.......... (idle for 47mn)
hey chfoo, could you remove password protection on archivebot logs? Anything that's private can just be requested via PM anyway [18:38]
***jsa has quit IRC (Remote host closed the connection)
jsa has joined #archiveteam-bs
Mateon1 has quit IRC (Ping timeout: 260 seconds)
Mateon1 has joined #archiveteam-bs
[18:39]
kristian_ has joined #archiveteam-bs [18:50]
..... (idle for 21mn)
spacegirl has quit IRC (Read error: Operation timed out)
spacegirl has joined #archiveteam-bs
icedice has joined #archiveteam-bs
[19:11]
dd0a13f37Does archivebot deduplicate? If archive.org already has something, will it upload a new copy still? [19:18]
JAAIt does not deduplicate (except inside a job). [19:18]
***kristian_ has quit IRC (Quit: Leaving)
odemg has joined #archiveteam-bs
[19:21]
SketchCowIt does NOT. [19:22]
dd0a13f37If something gets darked on IA, can you still download it if you ask them nicely out of band or is it kept secret until some arbitrary date? [19:24]
***box41 has joined #archiveteam-bs
box41 has quit IRC (Client Quit)
[19:26]
SketchCowNot that I know of [19:35]
chfoodd0a13f37, i can't. the logs were made private because of a good reason i can't remember. [19:40]
JAAchfoo: Can you tell me the password? I've been asking several times, and nobody knew what it was... :-| [19:41]
chfoosent a pm [19:43]
dd0a13f37SketchCow: You don't know you can download or that they keep it secret? [19:50]
***schbirid has quit IRC (Quit: Leaving)
kristian_ has joined #archiveteam-bs
[19:51]
dashclouddd0a13f37: I'm not sure he can tell you, but if you're a researcher, they can get access to other things in person at IA- you'd need to email any requests of that nature and get them approved first though [20:03]
dd0a13f37Okay, I see
So if I upload something that will get darked, it's not "wasted" in the sense that it will take 70+ years for it to become availab
le? In theory, that is
[20:08]
***icedice has quit IRC (Quit: Leaving)
icedice has joined #archiveteam-bs
[20:19]
dashclouddd0a13f37: you should realistically plan on never seeing something that was darked again, unless you have research credentials- that way you'll be pleasantly surprised in case it does turn up again [20:21]
dd0a13f37So in other words, you have to archive the archives
What about internetarchive.bak, won't they be getting a full collection?
[20:21]
JAAIA.BAK [20:21]
dd0a13f37IA.BAK, okay
Why did they change the name?
[20:22]
dashcloudsure- but generally darked items are spam or items that the copyright holder cares enough about to write in about- in which case, you should easily be able to get a copy elsewhere [20:22]
JAAIt's the same. We were writing at the same time. [20:22]
dd0a13f37ah okay
I'm wondering about the newspapers I'm archiving, since they're uploaded directly to IA and there's nobody that downloads and extracts PDFs, a copyright complaint could make them more or less permanently unavailable
[20:23]
dashclouddd0a13f37: if there's an archive that's being sold or actively managed, I'd be wary of uploading, otherwise chances seem pretty good [20:27]
dd0a13f37It's the subscriber-only section of a few swedish newspapers, some of it is old stuff (eg last 100 years except last 20), some of it is new stuff (eg last 3 years)
I've already uploaded it, I'm behind Tor so I don't care, but it would be a shame to have it all disappear into the void
[20:28]
dashcloudyou've uploaded it, hopefully with excellent metadata, so you've done the best you can [20:29]
dd0a13f37No, I could rent a cheap server and get the torrents if it's at risk of darking, but then I would have to fuck around with bitcoins
The metadata is not included since it's all from archivebot
[20:30]
***dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [20:43]
...... (idle for 26mn)
mls has quit IRC (Ping timeout: 250 seconds) [21:09]
mls has joined #archiveteam-bs
kristian_ has quit IRC (Quit: Leaving)
[21:17]
........ (idle for 35mn)
Mateon1 has quit IRC (Remote host closed the connection) [21:52]
........ (idle for 36mn)
drumstick has joined #archiveteam-bs [22:28]
Somebody2Anything that is darked *may* or *may not* be kept. No guarantees whatsover, except that it isn't available for any random person to download.
If you want to make sure something remains available, you need to keep a copy, and re-upload it to a new distribution channel
if the existing ones decide to cease distributing it.
[22:30]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)