Anyone want to download all the issues of Dagens Nyheter and Dagens Industri? I can get all the pdf links and the auth cookie, but they're 80-90mb a piece and archivebot won't handle auth cookie 1 per day, 3 different papers total, only the last 3 years would be 100gb They're both fairly large newspapers in sweden Now that's something I'd happily do. Yet 1 per day and the 3 last years? What do you mean? They're a newspaper, these are pdf renders of it There's only 1/day, Dagens industri only has going back last 3 years (DN has longer) one issue of the newspaper per day or one download per day (rate limit) One issue per day Gets published mundus: Did you convert these ebooks to epub or is that the original? what? http://dh.mundus.xyz/requests/Stephen%20King%20-%20The%20Dark%20Tower%20Series/ that's what I downloaded from bib What is bib? biblotik, largest ebook tracker Isn't libgen larger? mundus: and how might I get access to this? dunno find an invite thread on other trackers dd0a13f37, I thought bib was the biggest, not heard of libgen 2 million books (science/technology) + fiction + nearly all scientifical journals Torrents: 296700 spose so libgen.io ah, libgen isn't a tracker What is in this 2015.tar.gz ? mundus pack of books VADemon: my.mixtape.moe/fwijcd.txt my.mixtape.moe/eazdrw.txt my.mixtape.moe/oghmyp.txt see query for auth cookie https://mirror.mundus.xyz/drive/Archives/TrainPacks/ mundus: you don't have any bandwith limits do you? no hmm mundus: are there duplicates in these packs? no It's not a tracker, but it operates in the same way ooh boy mundus: is there metadata with these books? mundus: mind if I grab? dd0a13f37, so it follows DMCA takedowns? I don't know No go ahead thank you They don't care, their hosting is in the seychelles and domain in indian ocean mundus: what is here? https://mirror.mundus.xyz/drive/Keys/ keys lol Many of your dirs 404 I can't read it though... if they 404 refresh keys requires a password Those should not be in a web-exposed directory, even if it has password auth meh mundus: see https://mirror.mundus.xyz/drive/Pictures/ Yeah, that has a password If you want large amounts of books as torrents you can download libgen's torrents, you can also download their database (only 6gb), that contains hashes of all the books in the most common formats (when they coded it) MD5, TTH, ED2K, something more because I don't need any of you fucks seein my pics You can use that to bulk classify unsorted books mundus: what kind of pics? of my family? ahh, so you have family, ok eh its fine, good to have family mundus: can you unlock this https://mirror.mundus.xyz/drive/Keys/ And thanks! :D no aww is there a key to it? it's like ssh keys second: why are you asking dumb questions oooo mundus: I thought it was software keys and stuff, nvm Frogging: curious mundus: thank you very very much for the data sure hope there's no timing attacks in the password protection or anything like that, as we all know vulnerabilities in obscure features added as an afterthought are extremely rare if someone pwnd it I would give zero fucks mundus: how big is this directory if its not going to take you too much work to figure out https://mirror.mundus.xyz/drive/Manga/ 538GB thank you this is cool mundus, thanks for sharing np You should upload it somewhere like where what? it is uploaded "somewhere" http://libgen.io/comics/index.php The contents of the drive ... what? ??? Libgen accepts larger uploads via FTP You're saying upload the trainpacks stuff? Yes, or would they already have it all? ohh And also the manga, I think they would accept it under comics I thought you were saying the hwole thing isn't manga video or is that comics Anime is video, manga is comics uh I wish there was a better way I could host this stuff I get too much traffic Torrents? where's it hosted now? Google Drive + Caching FS + Scaleway But I have no money so this is the result Is google drive the backend? What? yes Isn't that expensive as fuck? No Or are you using lots of accounts? I have unlimited storage through school ayyyy is it encrypted on Drive? I always thought that just meant 20gbs instead of 2 Yes And I have an ebay account too which it's all mirrored to unencrypted cuz idgak ebay? Does ebay have cloud storage? *idgaf What about using torrents? You could use the server as webseed No, I bought an unlimited account off ebay that's a lot of hashing just what I have shared is 150TB So 5 months at 100mbit Could gradually switch it over I guess, some of it might already be hashed if you got it as a whole I dream of https://www.online.net/en/dedicated-server/dedibox-st48 At $4.5/tb-month you would recoup the costs pretty quickly by buying hard drives and hosting yourself and use some cheap VPS as reverse proxy I have a shit connection And my parents would never let me spend that much But $200/mo is okay? >Dream of no :p What's taking up the majority of the space? mundus: I must mirror your stuff quicker then... A lot of it already is, search for the filenames on btdig.com dd0a13f37: ind downloaded 791 files, 13G; wee 133 files and 1.9G; ovrigt 255 files and 3.9G 791 di_all.txt 255 dio_all.txt 133 diw_all.txt mundus: do you have this book? cooking for geeks 2nd edition http://libgen.io/book/index.php?md5=994D5F0D6F0D2C4F8107FCEF98080698 thanks have we archived libgen? No, but they provide repository torrents, are uploaded to usenet, and are backed up in various other places, maybe the internet archive mundus: I love dmca.mp4 thanks :) what is the source? I must know :p I need to make sure that's on all my servers idk Question about the !yahoo switch Is 4 workers the best for performance? not necessarily more workers = more load on the site so i'm looking at patreon to get money to buy vhs tapes But if they have a powerful CDN and they won't block you? sure After a certain point, it won't do anything - 1 million workers will just kill the pipeline so where's the "maximum"? dunno if there is one but most people just use the default of 3. there's no need to use more than that in most cases is there a https://github.com/bibanon/tubeup maintainer around here? refeed: #bibanon on Rizon IRC thx mundus: IA most likely has an archive of Library Genesis. But not publicly accessible. they ave not publicly accessible stuff? *have Yes, tons of it. All of Wayback Machine isn't publicly accessible, for example. Well, everything archived by IA's crawlers etc. If the copyright holder complains about an item, they'll also block access but not delete it, by the way. my libgen scimag effots are pretty dead. the torrents are not well seeded :( Had a request about "Deez Nutz" yup https://www.youtube.com/watch?v=uODUnXf-7qc Here's the Fat Boys in 1987 with a song called "My Nuts" Is there a limit to how many URLs curl can take in oen command And then in 1992, five years Later, Dr. Dre released a song called "Deez Nuuts". http://www.urbandictionary.com/define.php?term=deez%20nutz I did something like `curl 'something/search.php?page_number='{0..2000}` And it only got about halfway through nvm, i miscalculated the offsets >http://www.urbandictionary.com/define.php?term=deez%20nutz , okay now I understand ._. dd0a13f37: The only limit I can think of is the kernel's maximum command length limit. What's the fastest concurrency you should use if you're interacting with a powerful server and you're not ratelimited? At some point, it won't do anything. Right now I'm using 20, but should I crank it up further? Depends on your machine and the network, I'd say. For example, would 200 make things faster? What about 2000? 20000? At some point you're limited by internet, obviously (10mbit approx) I'm using aria2c, so CPU load isn't a problem I guess you'll just have to test. Check how many responses you get per time for different concurrencies, then pick the point of diminishing returns. That seems like a reasonable idea Another question: I have a list of links, they have 2 parameters, call them issue and pagenum each issue is a number of pages long, so if pagenum N exists then pagenum N-1 exists What's the best way to download all these? Get random sample of 1000 urls, increment page number by 1 until 0 matches, then feed to archivebot and put up with very ghigh error% (a very high rate of errors) I think this would be best solved with wpull and a hook script. You throw page 1 for each issue into wpull. The hook checks whether page N exists and adds page N+1 if so. That wastes one request per issue. It won't work with ArchiveBot though, obviously. Any idea how large this is? Data dependencies also No idea, 43618 issues, X pages per issue, each page is a jpg file 1950x? px one jpg is maybe 500k-1mb one issue is 20-50 pages, seems to vary Hmm, I see. That's a bit too large for me currently. But sending a missed HTTP request, is that such a big deal? I've seen some IP bans after too many 404s per time, but generally probably not. I'd be more worried about potentially missing pages. So I can do like I said, get the highest page number, then generate a list of pages to get? Does akamai ban for 404? Yeah, you can try that. We can always go through the logs later and see which issues had no 404s, then look into those in detail and queue any missed pages. No idea. I don't think I've archived anything from Akamai yet (at least not knowingly). But I doubt it, to be honest. Most of those that I've seen were obscure little websites. Well, the time-critical part of the archiving is done anyway So you probably have time, yeah Bloody hell, there's lots that have >100 pages while the majority are around 40, this will be much harder than I thought Hm, yeah, then the other way with wpull and a hook script might be better. It'll be a few TB then, I guess. Or you could autogen 200 requests/issue which should be plenty, then feed it into archivebot - at an overhead at 1kb/request and an average of 20pages/issue, this gives you 180kb wasted/issue, which is <2% of total Or will it take too much CPU? hey chfoo, could you remove password protection on archivebot logs? Anything that's private can just be requested via PM anyway Does archivebot deduplicate? If archive.org already has something, will it upload a new copy still? It does not deduplicate (except inside a job). It does NOT. If something gets darked on IA, can you still download it if you ask them nicely out of band or is it kept secret until some arbitrary date? Not that I know of dd0a13f37, i can't. the logs were made private because of a good reason i can't remember. chfoo: Can you tell me the password? I've been asking several times, and nobody knew what it was... :-| sent a pm SketchCow: You don't know you can download or that they keep it secret? dd0a13f37: I'm not sure he can tell you, but if you're a researcher, they can get access to other things in person at IA- you'd need to email any requests of that nature and get them approved first though Okay, I see So if I upload something that will get darked, it's not "wasted" in the sense that it will take 70+ years for it to become availab le? In theory, that is dd0a13f37: you should realistically plan on never seeing something that was darked again, unless you have research credentials- that way you'll be pleasantly surprised in case it does turn up again So in other words, you have to archive the archives What about internetarchive.bak, won't they be getting a full collection? IA.BAK IA.BAK, okay Why did they change the name? sure- but generally darked items are spam or items that the copyright holder cares enough about to write in about- in which case, you should easily be able to get a copy elsewhere It's the same. We were writing at the same time. ah okay I'm wondering about the newspapers I'm archiving, since they're uploaded directly to IA and there's nobody that downloads and extracts PDFs, a copyright complaint could make them more or less permanently unavailable dd0a13f37: if there's an archive that's being sold or actively managed, I'd be wary of uploading, otherwise chances seem pretty good It's the subscriber-only section of a few swedish newspapers, some of it is old stuff (eg last 100 years except last 20), some of it is new stuff (eg last 3 years) I've already uploaded it, I'm behind Tor so I don't care, but it would be a shame to have it all disappear into the void you've uploaded it, hopefully with excellent metadata, so you've done the best you can No, I could rent a cheap server and get the torrents if it's at risk of darking, but then I would have to fuck around with bitcoins The metadata is not included since it's all from archivebot Anything that is darked *may* or *may not* be kept. No guarantees whatsover, except that it isn't available for any random person to download. If you want to make sure something remains available, you need to keep a copy, and re-upload it to a new distribution channel if the existing ones decide to cease distributing it.