[00:02] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [00:41] *** BlueMaxim has joined #archiveteam-bs [01:22] *** Soni has joined #archiveteam-bs [02:40] *** dashcloud has quit IRC (Read error: Operation timed out) [02:41] *** dashcloud has joined #archiveteam-bs [02:45] if IA doesn't have a copy of libgen then i'll eat my hat [02:46] if IA has failed to snag a copy of libgen then we should really reconsider our life choices [02:51] libgen? [03:02] there was some discussion few hours ago [03:04] jrwr: libgen = library genisis [03:04] Woooo [03:06] Pretty sure this is the official site: http://gen.lib.rus.ec/ [03:06] However this is the site listed on wikipedia: https://libgen.pw/ [03:06] oh wait, that's one of them. DuckDuckGo was only showing one. [03:08] On a side note, whenever we email site owners asking them to cooperate with us, I recommend that we send it to another email address first to see if it ends up in spam. That happened with the owner of imgh.us. [03:12] Did you guys archive the-eye.eu? [03:12] It has a lot of data though... [03:21] We did not [03:23] Are there plans to do so? [03:24] About how much space does it take up? [03:25] Well just the MSDN dump is 2.7TB [03:25] eh [03:25] Another dump of comics from whenever (pretty much all the major studios) is about 3TB [03:25] Rom collection, not sure, pretty big I assume [03:25] Then there is the reddit rips they have [03:25] There is much stopping someone from uploading it to archive.org, maybe a mirror. [03:25] *isn't [03:26] What happens if I upload stuff, would the archive just delete it? [03:26] I think it should be archived but you'll need to wait til the copyright expires :/ [03:27] second, I have a copy of it [03:27] It's onyl like 8TB [03:27] But most is not legal content [03:27] yes [03:27] and if it was going to be mirrored, archivist would do it [03:27] Only [03:27] archivist is the one who owns it [03:28] Disclaimer: Most of us are not employed by archive.org. [03:28] From what I've heard however, they wait until a copyright holder sends them a notice. [03:28] Yeah, if he wanted it on IA it would be on IA [03:28] we don't talk about copyright in here, folks [03:28] take it to #scared-shitless [03:28] we don't? [03:28] haha [03:28] or maybe /r/legaladvice [03:28] #scared-shitless: Nick/channel is temporarily unavailable [03:29] okay what's that tell you [03:29] That there was a netsplit recently [03:31] I searched for the word "copyright" in the logs: 475 matches in 213 files [03:34] Lots of the stuff in the-eye appears to be porn. Still doesn't stop someone from attempting to upload it though. [03:39] Didn't know that [03:39] How is it "only" 8TB? [03:44] *** arkhive has joined #archiveteam-bs [04:25] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [04:42] *** kim__ has quit IRC (Ping timeout: 246 seconds) [04:46] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:53] *** Sk1d has joined #archiveteam-bs [04:54] Worth noting that IA standard procedure seems to be to dark an item instead of deleting it when a copyright claim is received [05:03] Correct [05:09] darking an item *may* mean that it's entirely gone, however. Or it may not. What it definitively means is that IA has ceased to *distribute* the item. [05:36] *** Asparagir has quit IRC (Asparagir) [06:15] *** Mateon1 has quit IRC (Remote host closed the connection) [06:15] *** Mateon1 has joined #archiveteam-bs [06:28] *** schbirid has joined #archiveteam-bs [06:46] *** robink has quit IRC (Ping timeout: 246 seconds) [06:46] *** robink has joined #archiveteam-bs [06:49] i threw medium.com into wpull and it OOMd :D [07:32] *** Honno has joined #archiveteam-bs [08:06] *** Jonison has joined #archiveteam-bs [08:24] *** BartoCH has joined #archiveteam-bs [08:42] *** Mateon1 has quit IRC (Ping timeout: 260 seconds) [08:42] *** Mateon1 has joined #archiveteam-bs [09:02] *** Jonison has quit IRC (Read error: Connection reset by peer) [09:31] *** icedice has joined #archiveteam-bs [09:31] *** icedice has quit IRC (Remote host closed the connection) [09:31] *** etudier has joined #archiveteam-bs [09:39] *** Jonison has joined #archiveteam-bs [10:18] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [10:34] *** etudier has joined #archiveteam-bs [10:53] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [11:21] *** pizzaiolo has joined #archiveteam-bs [11:32] *** dashcloud has quit IRC (Read error: Operation timed out) [11:39] *** dashcloud has joined #archiveteam-bs [11:51] *** mls has quit IRC (Ping timeout: 250 seconds) [12:03] *** mls has joined #archiveteam-bs [12:38] *** BlueMaxim has quit IRC (Quit: Leaving) [12:46] *** etudier has joined #archiveteam-bs [12:52] *** plue has quit IRC (Quit: WeeChat 1.5) [13:01] *** Jonison has quit IRC (Ping timeout: 260 seconds) [13:01] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [13:04] *** etudier has joined #archiveteam-bs [13:09] *** mls has quit IRC (Ping timeout: 250 seconds) [13:29] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [13:30] *** dashcloud has quit IRC (Read error: Operation timed out) [13:40] It is not gone if dark'd. [13:44] *** mls has joined #archiveteam-bs [13:45] *** etudier has joined #archiveteam-bs [13:50] *** dashcloud has joined #archiveteam-bs [13:59] *** etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [14:10] *** etudier has joined #archiveteam-bs [14:17] *** dd0a13f37 has joined #archiveteam-bs [14:19] hook54321: The official site is libgen.io (or the IP, 94.something), gen.lib.rus.ec is an official mirror which only has the metadata [14:19] hook54321: The official site is libgen.io (or the IP, 94.something), gen.lib.rus.ec is an official mirror which only has the metadata [14:20] second: In technological trouble, yes, but the operators will be fine. They have good opsec, have been doing this for 20 years, and the only one who isn't anonymous is a literal fugitive. They all live in the former soviet union too, so copyright is not a big problem there. [14:20] libgen.pw, b-ok, and bookza are unofficial mirrors. sci-hub is a sister project run by the aforemented fugitive using libgen as a storage backend and likely have backups too [14:21] The torrents are also decently seeded from various residential russian IPs, and there are probably more who aren't seeding the torrents since it's storage-bound [14:22] astrid: from what I can see, you're lacking a copy of sci-hub (aka sci-mag), which is really much more important than libgen [14:25] TIL someone thought it'd be a good idea to name a parasitic wasp after Elbakyan. [14:25] bit rude [14:26] Yeah, that's what she said as well. [14:26] But regarding the urgency of backing up Sci-Hub: I thought it's just a frontend to libgen? What additional data is there on SciHub? [14:27] well, I doubt they'll all be arrested at the same time since they're different projects [14:27] It's not that simple [14:27] sci-hub uses libgen as a backend [14:27] they have tons of "donated" accounts, and they cycle through them [14:27] and download articles [14:27] Scihub's articles are not in the main libgen collection [14:28] libgen is separated into sci-tech (libgen), comics, paintings, russian fiction ,foreign fiction, scimag [14:28] Oh [14:28] only sci-tech (libgen) is backed up afaik [14:28] maybe foreignfiction/rus fict too [14:28] Hm, I see. [14:29] look at the library genesis forum if you're curious about how it works [14:30] might be a good idea to use tor depending on where you live [14:33] and the libgen collection on IA is not complete from what I can see, https://archive.org/details/gen-lib&tab=about was last updated in 2016 [14:33] *** drumstick has quit IRC (Read error: Operation timed out) [14:33] and the libgen collection on IA is not complete from what I can see, https://archive.org/details/gen-lib&tab=about was last updated in 2016 [14:35] Mar 2017 according to the graph, but keep in mind that this might not be the correct collection. [14:36] That's the only one with any amount of activity [14:36] Unless they store it under some other name [14:36] or don't make it public at all [14:37] *** Soni has quit IRC (Ping timeout: 250 seconds) [14:45] that's not how you see if things have been uploaded to a collection [14:45] there are items in that collection from 2 days ago [14:45] How do you? [14:46] I don't know if there is a public way [14:46] Can you see what the name is? Is it something like 2092000? [14:46] r_1727000 [14:47] sci-hub can probably afford backups, they currently have 67 btc (USD $270k) in their bitcoin wallet, and their expenses are around "a few thousand" a month [14:47] that's an official torrent from 17-Aug-2017 [14:47] http://libgen.io/libgen/repository_torrent/ [14:47] Hey, remmeber the good times when I'd be able to answer Internet Archive questions helpfully [14:47] Before edsu implied that the Internet Archive banned him? [14:47] Those were good times. [14:48] How's that #internetarchive channel doing, anyway, now that I can't go in there [14:48] Would it be possible for you to add the later ones? It's as easy as downloading http://libgen.io/libgen/repository_torrent/r0-2092.ZIP and the last few ones, then deriving if I understand correctly [14:48] Oh, and why does edsu have op on #archiveteam again [14:49] When he wrote a whole essay with half-baked info about what the Internet Archive was going to do with robots.txt and got a wave of hatred? [14:49] Not that I'm going to jeopardize my job and ban him, or anything [14:49] r_2093000-r_2105000 from the site I linked [14:49] Quick summary? [14:49] Good times, good times [14:50] Heyyyyyy the Ted Nelson scans are going beautifully, and the CD-ROM scanning has a faster workflow [14:51] I paid $40 for a program that does nothing but crop [14:51] But it crops well! [14:51] do one thing and do it well [14:51] :) [14:52] This does the one thing very well. [14:53] It's called "Batchcrop" [14:54] I can say "OK, for the big pile of TIFFs I just scanned... crop away all the white part of the scan, with a X amount of pixels in all directions around the "content", and save it." [14:54] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [14:54] So basically, I can just keep shoving CDs into my scanner, one at a time, and just scan them each into a directory. [14:55] *** pizzaiolo has joined #archiveteam-bs [14:55] The longest part now is typing in names for the scans so they either match CD-ROMs I put up on archive with no scan, or match to rips I just did of same. [14:55] Can't most image processing programs do that? Or does it have a sophisticated white space detection algo? [14:55] I lent a guy some CDs to do this... 2 years ago [14:55] He sheepishly brought the bin back to me last week. [14:56] I scanned and cropped all 86 in about 1.5 hours [14:56] I'm sure all image processing programs do something. [14:56] They are many like it but this one is mine [14:59] For the really hostile sites, what about using commerccial proxy providers? I read about LJ on the wiki and they were apparently blacklisting your IPs [15:01] >For this project, set it to 1, beacuse LiveJournal tends to ban scrapers! [15:10] >Since 2015, Sci-Hub has operated its own repository , distinct from LibGen [15:11] If this is true (which I'm not sure of), then that might be why LG sci-mag torrents are unavailable [15:16] Would it be possible for you to add the later ones? [15:16] obviously somebody is actively working on it so there's no point in recruiting somebody else [15:18] All you have to do is upload torrent and derive [15:18] put that energy into archiving something that isn't famous [15:18] and it's not actively maintained as far as I udnerstand [15:18] I am, I'm waiting on some email responses currently [15:42] *** Mateon1 has quit IRC (Quit: Mateon1) [15:43] *** Mateon1 has joined #archiveteam-bs [16:21] http://www.bbc.com/news/uk-england-wiltshire-41267378 [16:44] *** odemg is now known as xbinwank [16:46] *** Honno has quit IRC (Read error: Operation timed out) [16:47] *** xbinwank is now known as odemg [16:48] Anyone here speak/understand korean? [17:31] *** Asparagir has joined #archiveteam-bs [17:31] *** svchfoo3 sets mode: +o Asparagir [17:31] *** svchfoo1 sets mode: +o Asparagir [17:39] *** kristian_ has joined #archiveteam-bs [17:50] *** dashcloud has quit IRC (Read error: Operation timed out) [17:50] *** dashcloud has joined #archiveteam-bs [17:54] www.korean-books.com.kp/en/packages/xnps/download.pg.php?419 change "en" to "ko en fr sp de ru ch ja ar" to taste and 430 to any number <= 430 [17:54] What's the proper way to archive something like this? Do you need WARC for what's just a GET request returning a file? [17:55] the proper way is to gin up a list of urls and submit to archivebot with !ao < http://url/yourlist.txt [17:56] Thanks! [17:57] then you can download the warc when the job is done and extract everything from it, if you're so inclined :) [17:58] Okay, so what's the proper way when there's also metadata in XML and thumbnails? Parse separately or make script to rename them to their "real" names? [17:59] hm? [18:00] They're named like 00000412.pdf [18:00] But they have names [18:00] one second, site takes a bit to load [18:01] They have names like "UNDERSTANDING KOREA (9) (HUMAN RIGHTS)" [18:01] Also metadata [18:01] "- Book on Common Sense -" [18:01] "Foreign Languages Publishing House" [18:01] "87 pp" [18:02] and an image [18:02] This won't be saved if you just have them archive a link list [18:02] if you're motivated / have skills, the best way would probably be to upload each pdf as a separate IA book item with metadata [18:02] well i'd add the xml files to the link list then [18:02] *** fie has quit IRC (Read error: Operation timed out) [18:03] It's not XML, you issue a POST request and get an entire page as HTML [18:07] So you'd have to parse it [18:07] I have neither the skills and there are a few thousand [18:12] ohh [18:12] https://pastebin.com/parEbjPK this is what it looks like [18:13] after parsing [18:13] you send a base64 encoded json dict [18:13] and get back a json dict [18:13] containing the page html [18:13] and it's escaped with backslashes two or three times [18:14] that sounds like a delight [18:14] check out their homemade CMS [18:14] It's stateful, you set which language you want, it saves it server-side [18:15] *** ReimuHaku has quit IRC (Ping timeout: 250 seconds) [18:18] !ao < https://my.mixtape.moe/nsrkrj.txt [18:18] like this? [18:19] you need http:// at the front of your urls [18:19] or https:// [18:19] or ftp:// [18:19] depending [18:20] thanks [18:21] !ao < https://my.mixtape.moe/tktryb.txt [18:21] these uh [18:21] aren't exactly pdfs [18:22] They are [18:22] Or does not handle content disposition? [18:22] they're pdfs with a sql statement at the front ??? [18:22] I can open them just fine [18:22] hm [18:23] maybe pdf doesn't mind about that [18:23] *** ReimuHaku has joined #archiveteam-bs [18:24] they seem to all start with [18:24] oh yeah I see [18:24] Update PublicationList_ko Set pVisitCount="2" Where pId=127%PDF-1.4 [18:24] it's uh [18:24] sqli [18:24] nice job folks [18:24] There are numerous other vulnerabilities too [18:25] figures [18:25] There's an undocumented way to register an account on KCNA [18:25] okay, well, go ahead and submit that job in #archivebot [18:25] which appears to do nothing [18:25] but it actually registers you [18:25] lol [18:25] and you can log in [18:25] and the only thing it does [18:25] is add some tracking code [18:25] you don't even show up as logged in [18:25] there is also a random zip file serving malware [18:44] Well, I can't get it to work. Any pointers? It needs a timeout of maybe 5 minutes for the first request, then some IP whitelisting or something happens [18:44] So just forcing IA to do a request would be fine [18:44] IA doesn't run archivebot :) [18:45] Does it use IA IPs? [18:45] no [18:45] we run archivebot [18:45] Does it share an IP with anything else? [18:45] it's a bunch of machines, run by several people in this channel [18:45] generally they have dedicated IPs, but multiple grabbers run per host [18:46] Do you run one of them? Can you force it to use a certain machine? [18:46] yes and yes [18:46] Do you have SSH access/similar? [18:46] i wasn't getting that whitelisting effect, btw [18:46] it may be that you've got browser keepalives going on [18:47] Nope, they do connection:close [18:48] I might be mistaking it for something else, but wget takes a long time (minutes) if it even does it [18:48] and ff is instant [18:49] archivebot is more similar to wget than to firefox [18:49] yeah [18:49] oh, apparently I have a phpsessid [18:49] well that explains it [18:50] "Apache/2.2.15 (RedStar 3.0)", how does it even work [18:50] Does it just randomly time out requests? [18:51] I managed to get one with wget now, connecting took 20 seconds and downloading 2m:20s (at 15 kbit) [19:01] *** zyphlar has joined #archiveteam-bs [19:09] Well, I can't wrap my head around north korean web magic [19:10] *** dd0a13f37 has left [19:48] https://www.eff.org/deeplinks/2017/09/open-letter-w3c-director-ceo-team-and-membership "Effective today, EFF is resigning from the W3C." [19:49] o_O [19:50] Wow [19:51] Ah, the DRM bullshit, right. [20:11] *** schbirid has quit IRC (Quit: Leaving) [20:20] holy crap [20:22] I imagine this event will be a bit different now. https://twitter.com/internetarchive/status/909868291249684480 [20:27] *** kim_ has joined #archiveteam-bs [20:48] *** Dark_Star has quit IRC (Remote host closed the connection) [21:11] *** zyphlar has quit IRC (Quit: Connection closed for inactivity) [21:29] *** Darkstar has joined #archiveteam-bs [21:43] *** noirscape has quit IRC (Read error: Operation timed out) [21:43] *** zino has quit IRC (Quit: Leaving) [21:46] https://www.youtube.com/watch?v=h94ZKGVg-B8 [21:46] I think we should post something about this on the ArchiveTeam twitter account. [21:56] who wants to start building rpi librarybox boxies? [21:59] *** BlueMaxim has joined #archiveteam-bs [22:00] *** balrog has quit IRC (Read error: Operation timed out) [22:00] *** JAA has quit IRC (Read error: Operation timed out) [22:00] *** C4K3 has quit IRC (Read error: Operation timed out) [22:00] *** ruunyan has quit IRC (Read error: Operation timed out) [22:00] *** squires has quit IRC (Read error: Operation timed out) [22:00] *** ZexaronS has quit IRC (Read error: Operation timed out) [22:01] *** rocode has quit IRC (Read error: Operation timed out) [22:01] *** ZexaronS has joined #archiveteam-bs [22:02] *** JAA has joined #archiveteam-bs [22:02] *** swebb sets mode: +o JAA [22:02] *** wp494 has quit IRC (Read error: Operation timed out) [22:02] *** squires has joined #archiveteam-bs [22:02] *** balrog has joined #archiveteam-bs [22:02] *** swebb sets mode: +o balrog [22:03] *** REiN^ has quit IRC (Write error: Broken pipe) [22:03] *** wp494 has joined #archiveteam-bs [22:03] *** PotcFdk has quit IRC (Write error: Broken pipe) [22:04] *** ruunyan has joined #archiveteam-bs [22:05] *** REiN^ has joined #archiveteam-bs [22:05] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [22:06] *** tfgbd_znc has joined #archiveteam-bs [22:06] *** rocode has joined #archiveteam-bs [22:07] *** drumstick has joined #archiveteam-bs [22:07] *** C4K3 has joined #archiveteam-bs [22:11] *** PotcFdk has joined #archiveteam-bs [22:15] *** wabu has quit IRC (Ping timeout: 246 seconds) [22:20] *** ola_norsk has joined #archiveteam-bs [22:21] godane: What are those? [22:21] is posting links possible? [22:21] yes definitely [22:21] ok, one sec [22:22] https://pbs.twimg.com/media/DKCW8SnWkAIgpqn.jpg:large [22:22] that is the result of the attempt [22:22] but, let me get the url to the tweet status, so you dont need to retype it from image [22:23] https://twitter.com/JeffHollandaise/status/897970096429084672 [22:24] hm, for some reason twitter has decided that you're coming from germany [22:24] I can view this url..but can't archive it. When i try, i get german twitter [22:24] i'm not sure how it decides that [22:24] probably the source IP that archive.org is using looks like a german IP [22:25] yes, it's not me [22:25] you are more than welcome to join #archivebot and do [22:25] !ao https://twitter.com/JeffHollandaise/status/897970096429084672 --ignore-sets=twitter [22:25] er, also the --phantomjs option [22:25] ill check it out. ty [22:26] *** wabu has joined #archiveteam-bs [22:26] but, i have to ask..what difference would it really make? [22:26] archivebot is run by us, and i haven't seen any german redirects affecting it [22:26] (archiveteam is not the same as archive.org, we have completely different infrastructure) [22:27] i mean looking like a german IP, would there be any difference in it working or not? [22:27] oh [22:27] uhhh, it shouldn't redirect [22:27] what do you want to happen exactly? [22:27] the --phantomjs option will pull in the css and images and javascript so it'll look and work correctly [22:28] doing it with archivebot will make sure it gets run from a jurisdiction where twitter won't screen out nazi imagery [22:28] i would expect waybackmachine to archive like regular [22:28] ok [22:28] wayback machine's liveweb feature usually works well but sometimes has some issues [22:28] twitter is a difficult website to archive [22:29] i've had no problem so far i think [22:29] hm okay [22:29] maybe it's because the tweet has nazi imagery in it, i know they filter that sort of thing out in some places [22:30] *** atluxity has quit IRC (Ping timeout: 506 seconds) [22:30] so in german twitter, nazi imagiry (i havent looked close to see if there was any), is screened? [22:30] sometimes? [22:30] it's not clear [22:30] i mean there is some nazi/kkk shit in that tweet [22:31] ok [22:32] anyway, thanks for the help. I can't stand nazism myself, but this was really frustrating [22:33] i'm not a fan either ... [22:33] yeah [22:35] but yeah. #archivebot is a channel on this network where we operate an irc bot that lets you submit links for archival [22:38] btw, i also tried to previously to archive my own https://pbs.twimg.com/media/DKCYoW-W4AAsH_T.jpg:large [22:39] and i can't see how i'm pegged as a nazi [22:40] well, looks like my home "server" build completes faster than I expected. scored a nice (imho) motherboard from the same seller I got the i3-2120 from. intel's dq67ow [22:40] ola_norsk: maybe hm maybe actually, that looks like archive.org's ip space has been blocked from using twitter without logging in [22:41] *** bluesoul has quit IRC (Read error: Operation timed out) [22:41] :( [22:41] haven't had much experience with Intel's boards in the past other than the DH77EB in my mother's HTPC which actually has been rock solid for the past 4+ years. and this thing was 32€, including shipping. not too shabby [22:41] *** svchfoo1 has quit IRC (Remote host closed the connection) [22:41] *** bluesoul has joined #archiveteam-bs [22:41] Lagittaja: i think that's completely offtopic for this channel [22:42] *** svchfoo1 has joined #archiveteam-bs [22:42] well sorry astrid, I have been having a conversation about this build with another person on this channel and I intend to use it to put more horse power for archiving. so sorry I'll see myself out [22:43] *** Lagittaja has quit IRC (Quit: Leaving) [22:43] *** svchfoo3 sets mode: +o svchfoo1 [22:48] ah, sorry, i didn't know [22:54] *** kristian_ has quit IRC (Quit: Leaving) [22:54] *** ola_norsk has left [23:17] *** BartoCH has quit IRC (Quit: WeeChat 1.9) [23:18] hook54321: i'm working on a project to add kiwix to slackwarearm 14.2 [23:19] https://archive.org/details/slackwarearm-14.2-20170906-kiwix [23:22] *** drumstick has quit IRC (Read error: Operation timed out) [23:23] https://twitter.com/xor/status/909888462584795136 [23:24] i now just need to write a script to mount /dev/sda2 and look for something like /mnt/data/kiwix for all the kiwix files [23:24] i have another script to build the library.xml file in /mnt/data/kiwix folder [23:25] then its kiwix --library $path/library.xml --port 8000 --daemon somthing [23:33] *** Soni has joined #archiveteam-bs [23:48] *** fie has joined #archiveteam-bs