[00:01] *** BlueMaxim has joined #archiveteam-bs [00:23] *** pizzaiolo has joined #archiveteam-bs [00:25] *** pizzaiolo has quit IRC (Client Quit) [00:42] *** BlueMaxim has quit IRC (Read error: Operation timed out) [00:43] *** BlueMaxim has joined #archiveteam-bs [02:31] *** pnJay has quit IRC (Leaving) [02:32] *** sep332_ has quit IRC (Read error: Operation timed out) [02:51] *** pizzaiol1 has left [02:59] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [03:01] *** dashcloud has joined #archiveteam-bs [04:01] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:02] *** BlueMaxim has joined #archiveteam-bs [04:27] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:34] *** Sk1d has joined #archiveteam-bs [06:27] *** wickedpla is now known as wp494 [06:33] *** DudesonMc has joined #archiveteam-bs [06:53] *** Stiletto has quit IRC (Read error: Connection reset by peer) [06:53] *** kniffy has quit IRC (Read error: Operation timed out) [06:55] *** Stilett0 has joined #archiveteam-bs [07:07] *** kniffy has joined #archiveteam-bs [07:07] *** Jonison has joined #archiveteam-bs [07:12] *** GE has joined #archiveteam-bs [07:23] *** CHRONO is now known as notabot [07:23] *** notabot is now known as chrono [07:44] *** chrono is now known as CHRONO [07:58] *** schbirid has joined #archiveteam-bs [08:02] *** GE has quit IRC (Remote host closed the connection) [08:13] *** GE has joined #archiveteam-bs [08:29] *** wp494 has quit IRC (Read error: Connection reset by peer) [08:36] *** GE has quit IRC (Remote host closed the connection) [08:39] *** CHRONO has quit IRC (Quit: ZNC 1.6.3+deb1 - http://znc.in) [08:39] *** chrono- has joined #archiveteam-bs [08:42] *** chrono- is now known as chrono [08:42] *** chrono is now known as SENDQ [08:46] *** SENDQ is now known as CHRONO [08:51] *** Stilett0 has quit IRC (Read error: Operation timed out) [08:53] *** Stilett0 has joined #archiveteam-bs [09:09] *** johtso has joined #archiveteam-bs [09:14] *** DudesonMc has quit IRC (Quit: http://www.mibbit.com ajax IRC Client) [10:35] *** BlueMaxim has quit IRC (Quit: Leaving) [10:46] *** BartoCH has quit IRC (Remote host closed the connection) [10:50] *** bsmith093 has quit IRC (Ping timeout: 260 seconds) [10:52] SketchCow: i'm uploading some old ezboard i grabbed from kbskorea [10:52] https://archive.org/details/kbskorea.net-bbs-ezboard-k_chuncheontv1-20151216 [10:54] this a full list of ones i got in the past: https://archive.org/search.php?query=subject%3A%22kbskorea.net%22&sort=-publicdate [11:23] *** fie has joined #archiveteam-bs [11:53] *** BartoCH has joined #archiveteam-bs [12:28] *** Lord_Nigh has quit IRC (Ping timeout: 250 seconds) [12:58] *** RichardG has quit IRC (Read error: Operation timed out) [13:03] *** Lord_Nigh has joined #archiveteam-bs [13:04] *** Lord_Nigh has quit IRC (Excess Flood) [13:04] *** Lord_Nigh has joined #archiveteam-bs [13:12] *** midas2 has joined #archiveteam-bs [13:13] *** midas has quit IRC (Ping timeout: 244 seconds) [13:15] *** Jonison2 has joined #archiveteam-bs [13:18] *** Jonison has quit IRC (Ping timeout: 260 seconds) [13:20] *** midas2 is now known as midas [13:24] *** Jonison has joined #archiveteam-bs [13:24] *** Jonison has quit IRC (Read error: Connection reset by peer) [13:27] *** Jonison2 has quit IRC (Ping timeout: 260 seconds) [13:32] *** pizzaiolo has joined #archiveteam-bs [15:11] *** RichardG has joined #archiveteam-bs [15:50] *** bsmith093 has joined #archiveteam-bs [16:08] *** Petri152 has quit IRC (Read error: Operation timed out) [16:18] *** Petri152 has joined #archiveteam-bs [16:39] *** zhongfu has quit IRC (Remote host closed the connection) [16:40] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [16:41] *** zhongfu has joined #archiveteam-bs [17:04] *** Lord_Nigh has joined #archiveteam-bs [17:11] *** wp494 has joined #archiveteam-bs [17:15] *** JAA has joined #archiveteam-bs [17:22] *** Pudsey has joined #archiveteam-bs [17:22] *** odemg has joined #archiveteam-bs [17:26] *** Pudsey has quit IRC (Remote host closed the connection) [17:27] *** cf has quit IRC (Ping timeout: 260 seconds) [17:42] *** cf has joined #archiveteam-bs [17:43] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [17:44] *** fie has quit IRC (Read error: Operation timed out) [17:57] *** GE has joined #archiveteam-bs [17:59] *** fie has joined #archiveteam-bs [18:23] *** Lord_Nigh has joined #archiveteam-bs [18:48] *** mls has quit IRC (Ping timeout: 250 seconds) [18:54] anyone got anything on a gbit link with ipv6? france/eu if possible, would like to do a quick iperf test [19:08] *** JAA_ has joined #archiveteam-bs [19:11] *** JAA has quit IRC (Ping timeout: 268 seconds) [19:11] *** bwn has quit IRC (Read error: Connection reset by peer) [19:12] *** mls has joined #archiveteam-bs [19:30] *** bwn has joined #archiveteam-bs [19:31] *** GE has quit IRC (Remote host closed the connection) [19:50] *** odemg has quit IRC (Remote host closed the connection) [19:51] *** odemg has joined #archiveteam-bs [19:51] *** pizzaiolo has quit IRC (Read error: Operation timed out) [19:52] *** odemg2 has joined #archiveteam-bs [19:52] *** odemg2 has quit IRC (Connection closed) [19:53] *** odemg2 has joined #archiveteam-bs [19:53] *** woktenna has joined #archiveteam-bs [19:56] *** odemg has quit IRC (Ping timeout: 245 seconds) [19:56] Guys... could something be done about domain parker? [19:56] They run nginx on canonical domain... serving robots.txt [19:57] And autoredirect to ww1.example.com [19:57] Which CNAMEs to various parking teams [19:58] Sorry, I know you are archivists, not retrievers... [19:58] But just in case you know... please tell [20:01] *** JAA_ is now known as JAA [20:02] *** odemg2 has quit IRC (Read error: Operation timed out) [20:03] *** odemg has joined #archiveteam-bs [20:06] *** pizzaiolo has joined #archiveteam-bs [20:12] Try yourself: curl http://survey-winner.net/robots.txt [20:13] And curl http://survey-winner.net/ 302s to curl http://ww1.survey-winner.net/ [20:25] well what would you want to do about it? [20:33] cammon [20:33] I would want to access the Archive! [20:34] ? [20:35] Look, the guys behind the http://survey-winner.net/ has set up an nginx [20:35] on multiple IP addresses [20:35] Thousands of domains resolve to those IP addresses [20:36] Those are long-ago _expired_ domains, which previously belonged to old websites [20:37] But they hold them as hostages [20:37] So they could (presumably) make money on domain parking [20:38] To clear this a bit: [20:38] They do not provide domain parking themselves [20:39] They just set up a server to redirect [20:39] to ww1.*whatever*, which CNAMEs to actual domain parkers [20:40] BUT [20:41] www.survey-winner.net or whatever points to a stub webserver [20:41] which unfortunately hosts robots.txt [20:41] Any suggestions? [20:46] @schbirid, I wonder if you addressed me in particular [20:46] Sorry if I jumped the conversation [20:46] you are describing domain squatting [20:46] *** icedice has joined #archiveteam-bs [20:46] Sort of [20:46] but not what your problem is that you want to solve [20:48] @schbirid Web Archive _prohibits_ browsing of websites with robots.txt [20:48] ah, we are not the Internet Archive [20:48] and yes that is a known and well disliked feature [20:48] Look, I already said [20:48] > Sorry, I know you are archivists, not retrievers... [20:49] But just in case you know... please tell [20:49] no way around it [20:49] :} [20:49] err -> :\ [20:55] hm, my corpweb proxy is smart enough to block http://web.archive.org/web/*/http://survey-winner.net/ under the survey-winner.net block [20:55] in other news thanks for making me trip my corporate web proxy, woktenna [20:55] Try changing to https:* [20:55] -_- [20:56] yes that's fine [20:56] *** GE has joined #archiveteam-bs [20:56] still yet another log of me doing something that's not work [21:01] xmc: HTTPS blocked? [21:02] no https works fine but the "view this site's robots.txt" link goes to a plaintext link [21:02] Nvm. Can't read [21:02] on the target domain [21:03] The http://survey-winner.net/ is nothing but another example of this practice [21:04] It's peculiar though [21:04] there's no history of it in the archive prior to its domain squatting [21:04] Because if you try to access it with changed 'Host:' header [21:05] The webserver will still point to http://ww1.survey-winner.net/ [21:05] In other words [21:05] It is a stub in case no such domain is in their database [21:06] I will come up with another domain, wait a mo [21:07] Try curl http://1papercraft.com/robots.txt [21:07] It's the same people [21:07] ah [21:07] IPs are different, though [21:08] But their webserver config is the same [21:11] No point in enumerating all the domains [21:13] Many are just spam, some are priceless (belonged to websites in past) [21:22] If you want to look further, I used www.robtex.com to reverse IPs to domains [21:22] Try this: 51.254.28.162 [21:29] *** schbirid has quit IRC (Quit: Leaving) [21:34] To be precise: how can we convince them to add the exception to robots.txt? [21:35] I'm fine with their profits on expired domains [21:35] Not all evil could be rooted today [21:37] squatters will not listen to you. ia is doing something about it, slowly. [21:38] @xmc Are you with IA? [21:38] no [21:38] How could you know then? [21:38] because i talk to people who are [21:51] Is it possible to find admin of those webservers? [21:52] Chances are the squatters outsource their operation [21:52] And only point their A records [22:26] *** kristian_ has joined #archiveteam-bs [22:40] *** GE has quit IRC (Remote host closed the connection) [22:50] Update on InterfaceLIFT: I now have a functional wpull hook script which retrieves all sensible resources not accessible directly (images in all resolutions and the portfolio/submission browsers). [22:50] Note that ArchiveBot did pick up images in some resolutions, but I'm almost certain it'll only be able to find about half of them; it'll also miss the portfolio and submission browsers (which are actually pretty redundant but still nice to have for a fully functional archive; they won't work in the Wayback Machine though). [22:55] Unfortunately, based on a very rough estimate, the full archive will be several hundred GB, which is more space than I have available currently. If anyone of you wants to run it or has other suggestions, let me know. [22:55] tammy_: ^ [22:58] JAA: If your willing to help me, I'll run it. I storage for days. [22:58] *I got [23:01] tammy_: Sure. Do you have a functioning wpull? [23:02] Version 1.2.3, that is. [23:03] nope, never even heard of it [23:04] I stick to wget [23:05] I see. Do you have Python and pip? [23:06] I can aquire anything. infact I'm standing up a new VM for this [23:08] single core good enough? [23:11] I guess so, yeah. The limiting factor is time (not overloading the server) and network anyway. [23:12] You'll need Python 3.2+ (including the dev headers) and pip. Which OS are you using? [23:14] I'm gigabit, if need be, but I'd rather work through my vpn server. I don't mind cutting over to my personal network if time requires it. [23:14] Debian 8.7 [23:15] I can stand up something different if that's an issue [23:16] It's the server's network which is slow. Gbit or 10 Mbit probably doesn't make any difference. [23:16] Debian's perfect. :-) [23:20] So the required Python packages are python3 and python3-dev. If you want to install pip system-wide, python3-pip; I normally install it per-user on my machines using https://bootstrap.pypa.io/get-pip.py (wget, then python3 get-pip.py --user). [23:21] I am root, I'll just apt it :) [23:21] Ok [23:22] Then: pip install html5lib==0.9999999 (wpull hasn't been updated to deal with the newest version, and the dependencies haven't been fixed either...) [23:22] Followed by: pip install wpull==1.2.3 psutil (I think everything else gets pulled automatically) [23:23] Add a --user flag if you want to do that in the user's directory instead. [23:29] Here's the hook script and the wpull command I used for testing: https://gist.github.com/anonymous/c752b52901d6688d8b677e759c694896 [23:53] *** BlueMaxim has joined #archiveteam-bs [23:57] *** WIDOW has joined #archiveteam-bs