[00:04] *** Odd0002 has joined #archiveteam-bs [00:10] *** Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [00:11] *** Odd0002 has joined #archiveteam-bs [00:18] *** Odd0002 has quit IRC (ZNC - http://znc.in) [00:19] *** Odd0002 has joined #archiveteam-bs [00:28] Anyone want to download all the issues of Dagens Nyheter and Dagens Industri? I can get all the pdf links and the auth cookie, but they're 80-90mb a piece and archivebot won't handle auth cookie [00:29] 1 per day, 3 different papers total, only the last 3 years would be 100gb [00:34] They're both fairly large newspapers in sweden [00:50] Now that's something I'd happily do. Yet 1 per day and the 3 last years? [00:51] What do you mean? They're a newspaper, these are pdf renders of it [00:52] There's only 1/day, Dagens industri only has going back last 3 years (DN has longer) [00:53] one issue of the newspaper per day or one download per day (rate limit) [01:00] One issue per day [01:00] Gets published [01:02] *** BlueMaxim has quit IRC (Quit: Leaving) [01:03] *** Aranje has joined #archiveteam-bs [01:04] mundus: Did you convert these ebooks to epub or is that the original? [01:04] what? [01:05] http://dh.mundus.xyz/requests/Stephen%20King%20-%20The%20Dark%20Tower%20Series/ [01:05] that's what I downloaded from bib [01:05] What is bib? [01:05] biblotik, largest ebook tracker [01:06] *** refeed has joined #archiveteam-bs [01:06] Isn't libgen larger? [01:07] mundus: and how might I get access to this? [01:07] dunno [01:07] find an invite thread on other trackers [01:08] dd0a13f37, I thought bib was the biggest, not heard of libgen [01:08] 2 million books (science/technology) + fiction + nearly all scientifical journals [01:08] *** swebb has quit IRC (Read error: Operation timed out) [01:09] *** atlogbot has quit IRC (Read error: Operation timed out) [01:10] Torrents: 296700 [01:10] spose so [01:12] libgen.io [01:12] ah, libgen isn't a tracker [01:13] What is in this 2015.tar.gz ? mundus [01:13] pack of books [01:13] VADemon: my.mixtape.moe/fwijcd.txt my.mixtape.moe/eazdrw.txt my.mixtape.moe/oghmyp.txt see query for auth cookie [01:14] https://mirror.mundus.xyz/drive/Archives/TrainPacks/ [01:14] mundus: you don't have any bandwith limits do you? [01:14] no [01:14] hmm [01:14] mundus: are there duplicates in these packs? [01:15] no [01:16] It's not a tracker, but it operates in the same way [01:16] ooh boy [01:16] mundus: is there metadata with these books? [01:16] mundus: mind if I grab? [01:16] dd0a13f37, so it follows DMCA takedowns? [01:17] I don't know [01:17] No [01:17] go ahead [01:17] thank you [01:17] They don't care, their hosting is in the seychelles and domain in indian ocean [01:17] *** r3c0d3x has quit IRC (Ping timeout: 260 seconds) [01:17] mundus: what is here? https://mirror.mundus.xyz/drive/Keys/ [01:18] keys lol [01:18] Many of your dirs 404 [01:18] I can't read it though... [01:18] if they 404 refresh [01:19] keys requires a password [01:19] Those should not be in a web-exposed directory, even if it has password auth [01:19] meh [01:20] *** r3c0d3x has joined #archiveteam-bs [01:20] mundus: see https://mirror.mundus.xyz/drive/Pictures/ [01:20] Yeah, that has a password [01:20] If you want large amounts of books as torrents you can download libgen's torrents, you can also download their database (only 6gb), that contains hashes of all the books in the most common formats (when they coded it) [01:20] MD5, TTH, ED2K, something more [01:20] because I don't need any of you fucks seein my pics [01:21] You can use that to bulk classify unsorted books [01:22] *** Aranje has quit IRC (Ping timeout: 506 seconds) [01:24] mundus: what kind of pics? [01:25] of my family? [01:27] ahh, so you have family, ok [01:27] eh [01:28] its fine, good to have family [01:28] mundus: can you unlock this https://mirror.mundus.xyz/drive/Keys/ [01:28] And thanks! :D [01:28] no [01:28] aww [01:28] is there a key to it? [01:28] it's like ssh keys [01:28] second: why are you asking dumb questions [01:28] oooo [01:29] mundus: I thought it was software keys and stuff, nvm [01:29] Frogging: curious [01:30] mundus: thank you very very much for the data [01:30] sure hope there's no timing attacks in the password protection or anything like that, as we all know vulnerabilities in obscure features added as an afterthought are extremely rare [01:30] *** BlueMaxim has joined #archiveteam-bs [01:30] if someone pwnd it I would give zero fucks [01:31] *** r3c0d3x has quit IRC (Read error: Connection timed out) [01:31] *** r3c0d3x has joined #archiveteam-bs [01:32] mundus: how big is this directory if its not going to take you too much work to figure out https://mirror.mundus.xyz/drive/Manga/ [01:33] 538GB [01:33] thank you [01:33] this is cool mundus, thanks for sharing [01:33] np [01:34] You should upload it somewhere [01:34] like where [01:34] what? [01:34] it is uploaded "somewhere" [01:34] http://libgen.io/comics/index.php [01:34] The contents of the drive [01:34] ... [01:35] what? [01:35] ??? [01:35] Libgen accepts larger uploads via FTP [01:35] You're saying upload the trainpacks stuff? [01:36] Yes, or would they already have it all? [01:36] ohh [01:36] And also the manga, I think they would accept it under comics [01:36] I thought you were saying the hwole thing [01:36] isn't manga video [01:37] or is that comics [01:37] Anime is video, manga is comics [01:38] uh I wish there was a better way I could host this stuff [01:38] I get too much traffic [01:38] Torrents? [01:38] where's it hosted now? [01:38] Google Drive + Caching FS + Scaleway [01:39] But I have no money [01:39] so [01:39] this is the result [01:39] Is google drive the backend? What? [01:39] yes [01:39] Isn't that expensive as fuck? [01:39] No [01:39] Or are you using lots of accounts? [01:39] I have unlimited storage through school [01:39] ayyyy [01:40] is it encrypted on Drive? [01:40] I always thought that just meant 20gbs instead of 2 [01:40] Yes [01:40] And I have an ebay account too [01:40] which it's all mirrored to [01:41] unencrypted [01:41] cuz idgak [01:41] ebay? [01:41] Does ebay have cloud storage? [01:41] *idgaf [01:41] What about using torrents? You could use the server as webseed [01:41] No, I bought an unlimited account off ebay [01:41] that's a lot of hashing [01:41] just what I have shared is 150TB [01:42] So 5 months at 100mbit [01:42] Could gradually switch it over I guess, some of it might already be hashed if you got it as a whole [01:44] I dream of https://www.online.net/en/dedicated-server/dedibox-st48 [01:47] At $4.5/tb-month you would recoup the costs pretty quickly by buying hard drives and hosting yourself [01:47] and use some cheap VPS as reverse proxy [01:47] I have a shit connection [01:47] And my parents would never let me spend that much [01:48] But $200/mo is okay? [01:48] >Dream of [01:48] no :p [01:49] What's taking up the majority of the space? [01:52] *** drumstick has quit IRC (Remote host closed the connection) [02:04] *** ruunyan has joined #archiveteam-bs [02:05] mundus: I must mirror your stuff quicker then... [02:13] A lot of it already is, search for the filenames on btdig.com [02:26] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [02:33] *** swebb has joined #archiveteam-bs [02:34] *** svchfoo1 sets mode: +o swebb [02:35] dd0a13f37: ind downloaded 791 files, 13G; wee 133 files and 1.9G; ovrigt 255 files and 3.9G [02:43] 791 di_all.txt 255 dio_all.txt 133 diw_all.txt [02:44] mundus: do you have this book? cooking for geeks 2nd edition [02:47] http://libgen.io/book/index.php?md5=994D5F0D6F0D2C4F8107FCEF98080698 [02:53] thanks [03:07] have we archived libgen? [03:09] No, but they provide repository torrents, are uploaded to usenet, and are backed up in various other places, maybe the internet archive [03:13] mundus: I love dmca.mp4 [03:13] thanks :) [03:13] what is the source? I must know :p [03:13] I need to make sure that's on all my servers [03:13] idk [03:15] Question about the !yahoo switch [03:15] Is 4 workers the best for performance? [03:16] not necessarily [03:16] more workers = more load on the site [03:16] so i'm looking at patreon to get money to buy vhs tapes [03:17] But if they have a powerful CDN [03:17] and they won't block you? sure [03:17] After a certain point, it won't do anything - 1 million workers will just kill the pipeline [03:17] so where's the "maximum"? [03:18] dunno if there is one but most people just use the default of 3. there's no need to use more than that in most cases [03:32] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [03:50] *** zenguy has quit IRC (Read error: Operation timed out) [03:55] *** drumstick has joined #archiveteam-bs [04:37] *** _refeed_ has joined #archiveteam-bs [04:38] *** refeed has quit IRC (Read error: Connection reset by peer) [04:39] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:41] *** pizzaiolo has quit IRC (Ping timeout: 260 seconds) [04:46] *** Sk1d has joined #archiveteam-bs [04:49] *** pizzaiolo has joined #archiveteam-bs [06:19] *** __refeed_ has joined #archiveteam-bs [06:27] *** _refeed_ has quit IRC (Ping timeout: 600 seconds) [06:50] *** schbirid has joined #archiveteam-bs [07:09] *** __refeed_ is now known as refeed [07:10] is there a https://github.com/bibanon/tubeup maintainer around here? [07:43] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [07:49] refeed: #bibanon on Rizon IRC [07:49] *** VADemon has quit IRC (Quit: left4dead) [07:50] thx [07:52] *** BartoCH has joined #archiveteam-bs [09:07] mundus: IA most likely has an archive of Library Genesis. [09:07] But not publicly accessible. [09:08] they ave not publicly accessible stuff? [09:08] *have [09:11] Yes, tons of it. [09:12] All of Wayback Machine isn't publicly accessible, for example. [09:12] Well, everything archived by IA's crawlers etc. [09:13] If the copyright holder complains about an item, they'll also block access but not delete it, by the way. [09:38] my libgen scimag effots are pretty dead. the torrents are not well seeded :( [09:51] *** dashcloud has quit IRC (Read error: Operation timed out) [09:55] *** dashcloud has joined #archiveteam-bs [10:42] *** Dimtree has quit IRC (Read error: Operation timed out) [11:05] *** Dimtree has joined #archiveteam-bs [11:09] *** brayden has joined #archiveteam-bs [11:09] *** swebb sets mode: +o brayden [11:12] *** drumstick has quit IRC (Read error: Operation timed out) [11:18] *** brayden has quit IRC (Ping timeout: 255 seconds) [11:21] *** brayden has joined #archiveteam-bs [11:21] *** swebb sets mode: +o brayden [11:26] *** brayden has quit IRC (Ping timeout: 255 seconds) [11:28] *** brayden has joined #archiveteam-bs [11:28] *** swebb sets mode: +o brayden [12:03] *** refeed has quit IRC (Ping timeout: 600 seconds) [12:21] *** brayden has quit IRC (Ping timeout: 255 seconds) [12:24] *** brayden has joined #archiveteam-bs [12:24] *** swebb sets mode: +o brayden [12:28] *** Soni has quit IRC (Read error: Operation timed out) [12:35] *** dashcloud has quit IRC (Read error: Operation timed out) [12:35] *** dashcloud has joined #archiveteam-bs [12:39] *** Soni has joined #archiveteam-bs [13:01] *** dd0a13f37 has joined #archiveteam-bs [13:30] *** refeed has joined #archiveteam-bs [13:53] *** BlueMaxim has quit IRC (Quit: Leaving) [14:22] Had a request about "Deez Nutz" [14:22] yup [14:22] https://www.youtube.com/watch?v=uODUnXf-7qc [14:23] Here's the Fat Boys in 1987 with a song called "My Nuts" [14:23] * refeed is watching the video [14:24] Is there a limit to how many URLs curl can take in oen command [14:24] And then in 1992, five years Later, Dr. Dre released a song called "Deez Nuuts". [14:24] http://www.urbandictionary.com/define.php?term=deez%20nutz [14:25] I did something like `curl 'something/search.php?page_number='{0..2000}` [14:25] And it only got about halfway through [14:30] nvm, i miscalculated the offsets [14:36] >http://www.urbandictionary.com/define.php?term=deez%20nutz , okay now I understand ._. [14:54] *** mls has quit IRC (Ping timeout: 250 seconds) [14:55] *** mls has joined #archiveteam-bs [15:26] *** refeed has quit IRC (Leaving) [15:26] *** mls has quit IRC (Ping timeout: 250 seconds) [15:49] *** mls has joined #archiveteam-bs [16:00] *** brayden has quit IRC (Read error: Connection reset by peer) [16:01] *** brayden has joined #archiveteam-bs [16:01] *** swebb sets mode: +o brayden [16:01] *** brayden has quit IRC (Read error: Connection reset by peer) [16:02] *** brayden has joined #archiveteam-bs [16:02] *** swebb sets mode: +o brayden [16:19] *** Soni has quit IRC (Ping timeout: 255 seconds) [16:24] *** Soni has joined #archiveteam-bs [16:31] *** odemg has quit IRC (Quit: Leaving) [16:31] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [16:36] dd0a13f37: The only limit I can think of is the kernel's maximum command length limit. [16:45] *** dd0a13f37 has joined #archiveteam-bs [17:04] What's the fastest concurrency you should use if you're interacting with a powerful server and you're not ratelimited? At some point, it won't do anything. Right now I'm using 20, but should I crank it up further? [17:04] Depends on your machine and the network, I'd say. [17:04] For example, would 200 make things faster? What about 2000? 20000? At some point you're limited by internet, obviously (10mbit approx) [17:05] I'm using aria2c, so CPU load isn't a problem [17:06] I guess you'll just have to test. Check how many responses you get per time for different concurrencies, then pick the point of diminishing returns. [17:08] That seems like a reasonable idea [17:14] Another question: I have a list of links, they have 2 parameters, call them issue and pagenum [17:14] each issue is a number of pages long, so if pagenum N exists then pagenum N-1 exists [17:15] What's the best way to download all these? Get random sample of 1000 urls, increment page number by 1 until 0 matches, then feed to archivebot and put up with very ghigh error% [17:16] (a very high rate of errors) [17:18] I think this would be best solved with wpull and a hook script. You throw page 1 for each issue into wpull. The hook checks whether page N exists and adds page N+1 if so. [17:18] That wastes one request per issue. [17:18] It won't work with ArchiveBot though, obviously. [17:21] Any idea how large this is? [17:21] Data dependencies also [17:23] No idea, 43618 issues, X pages per issue, each page is a jpg file 1950x? px [17:23] one jpg is maybe 500k-1mb [17:23] one issue is 20-50 pages, seems to vary [17:24] Hmm, I see. That's a bit too large for me currently. [17:27] But sending a missed HTTP request, is that such a big deal? [17:29] I've seen some IP bans after too many 404s per time, but generally probably not. I'd be more worried about potentially missing pages. [17:30] So I can do like I said, get the highest page number, then generate a list of pages to get? [17:31] Does akamai ban for 404? [17:32] Yeah, you can try that. We can always go through the logs later and see which issues had no 404s, then look into those in detail and queue any missed pages. [17:32] No idea. I don't think I've archived anything from Akamai yet (at least not knowingly). [17:33] But I doubt it, to be honest. Most of those that I've seen were obscure little websites. [17:33] Well, the time-critical part of the archiving is done anyway [17:33] So you probably have time, yeah [17:44] Bloody hell, there's lots that have >100 pages while the majority are around 40, this will be much harder than I thought [17:45] Hm, yeah, then the other way with wpull and a hook script might be better. [17:46] It'll be a few TB then, I guess. [17:49] Or you could autogen 200 requests/issue which should be plenty, then feed it into archivebot - at an overhead at 1kb/request and an average of 20pages/issue, this gives you 180kb wasted/issue, which is <2% of total [17:51] Or will it take too much CPU? [18:38] hey chfoo, could you remove password protection on archivebot logs? Anything that's private can just be requested via PM anyway [18:39] *** jsa has quit IRC (Remote host closed the connection) [18:40] *** jsa has joined #archiveteam-bs [18:42] *** Mateon1 has quit IRC (Ping timeout: 260 seconds) [18:42] *** Mateon1 has joined #archiveteam-bs [18:50] *** kristian_ has joined #archiveteam-bs [19:11] *** spacegirl has quit IRC (Read error: Operation timed out) [19:14] *** spacegirl has joined #archiveteam-bs [19:15] *** icedice has joined #archiveteam-bs [19:18] Does archivebot deduplicate? If archive.org already has something, will it upload a new copy still? [19:18] It does not deduplicate (except inside a job). [19:21] *** kristian_ has quit IRC (Quit: Leaving) [19:22] *** odemg has joined #archiveteam-bs [19:22] It does NOT. [19:24] If something gets darked on IA, can you still download it if you ask them nicely out of band or is it kept secret until some arbitrary date? [19:26] *** box41 has joined #archiveteam-bs [19:30] *** box41 has quit IRC (Client Quit) [19:35] Not that I know of [19:40] dd0a13f37, i can't. the logs were made private because of a good reason i can't remember. [19:41] chfoo: Can you tell me the password? I've been asking several times, and nobody knew what it was... :-| [19:43] sent a pm [19:50] SketchCow: You don't know you can download or that they keep it secret? [19:51] *** schbirid has quit IRC (Quit: Leaving) [19:54] *** kristian_ has joined #archiveteam-bs [20:03] dd0a13f37: I'm not sure he can tell you, but if you're a researcher, they can get access to other things in person at IA- you'd need to email any requests of that nature and get them approved first though [20:08] Okay, I see [20:08] So if I upload something that will get darked, it's not "wasted" in the sense that it will take 70+ years for it to become availab [20:08] le? In theory, that is [20:19] *** icedice has quit IRC (Quit: Leaving) [20:20] *** icedice has joined #archiveteam-bs [20:21] dd0a13f37: you should realistically plan on never seeing something that was darked again, unless you have research credentials- that way you'll be pleasantly surprised in case it does turn up again [20:21] So in other words, you have to archive the archives [20:21] What about internetarchive.bak, won't they be getting a full collection? [20:21] IA.BAK [20:22] IA.BAK, okay [20:22] Why did they change the name? [20:22] sure- but generally darked items are spam or items that the copyright holder cares enough about to write in about- in which case, you should easily be able to get a copy elsewhere [20:22] It's the same. We were writing at the same time. [20:23] ah okay [20:23] I'm wondering about the newspapers I'm archiving, since they're uploaded directly to IA and there's nobody that downloads and extracts PDFs, a copyright complaint could make them more or less permanently unavailable [20:27] dd0a13f37: if there's an archive that's being sold or actively managed, I'd be wary of uploading, otherwise chances seem pretty good [20:28] It's the subscriber-only section of a few swedish newspapers, some of it is old stuff (eg last 100 years except last 20), some of it is new stuff (eg last 3 years) [20:29] I've already uploaded it, I'm behind Tor so I don't care, but it would be a shame to have it all disappear into the void [20:29] you've uploaded it, hopefully with excellent metadata, so you've done the best you can [20:30] No, I could rent a cheap server and get the torrents if it's at risk of darking, but then I would have to fuck around with bitcoins [20:30] The metadata is not included since it's all from archivebot [20:43] *** dd0a13f37 has quit IRC (Ping timeout: 268 seconds) [21:09] *** mls has quit IRC (Ping timeout: 250 seconds) [21:17] *** mls has joined #archiveteam-bs [21:17] *** kristian_ has quit IRC (Quit: Leaving) [21:52] *** Mateon1 has quit IRC (Remote host closed the connection) [22:28] *** drumstick has joined #archiveteam-bs [22:30] Anything that is darked *may* or *may not* be kept. No guarantees whatsover, except that it isn't available for any random person to download. [22:31] If you want to make sure something remains available, you need to keep a copy, and re-upload it to a new distribution channel [22:31] if the existing ones decide to cease distributing it.