[00:01] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [00:37] *** BlueMax has joined #archiveteam-bs [01:46] *** purplebot has quit IRC (Read error: Operation timed out) [01:46] *** PurpleSym has quit IRC (Read error: Operation timed out) [01:53] *** purplebot has joined #archiveteam-bs [01:54] *** PurpleSym has joined #archiveteam-bs [02:14] *** Raccoon has quit IRC (Remote host closed the connection) [02:14] *** Raccoon has joined #archiveteam-bs [02:19] *** odemg has quit IRC (Ping timeout: 260 seconds) [02:32] *** odemg has joined #archiveteam-bs [03:22] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:34] *** odemg has joined #archiveteam-bs [04:12] *** Arctic has joined #archiveteam-bs [04:14] [2018-08-28 04:13:08] We should probably archive http://hiddenpalace.org/ as it contains a lot of significant prototypes of games and Nintendo as of late has been on a high-profile crusade against ROMS. [04:15] We likely won't get the ROMS but will get the pages relating to the ROMS [04:15] put it in the archive bot because Mwuhahahahaha [04:15] The roms are the important part [04:15] here [04:25] Kiska the roms are the important part here in my opinion [04:25] however it would probably be best if someone grabbed them and uploaded them to the archive rather than through the wayback macgine [04:26] ROMs are important if they can't be found elsewhere [04:26] These ones are [04:27] Alright let's see how archivebot handles the downloads [04:27] they are literally the only places you can find 90% of this stuff [04:27] Should be easy since the download links are in the plain [04:28] I don't see any JavaScript wrapping it [04:47] Thanks. [04:51] kiska: So we're using ArchiveBot to archive the site and ROMs? [05:16] i'm seeing about making a hiddenpalace.org warc [05:16] this cause i'm going to try to extract the rom images urls from it [05:16] then we have a list that i can do a !ao < pastebin of it [05:22] Alright. Where is it going to be hosted? [05:25] The internet archive if things all go right [05:28] Sounds good. [05:28] Wayback Machine? [05:28] *** chferfa has quit IRC () [05:30] Depends [06:06] godane I would not do that since archivebot should get the roms as well as everything it has on the pages unless there is severe JavaScript obscurification [06:07] And also do it after we finish I'm the archivebot job since I am going to assume it will tax their server [07:00] *** Arctic has quit IRC (Quit: Page closed) [07:56] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [07:56] *** Mateon1 has joined #archiveteam-bs [09:24] *** purplebot has quit IRC (Remote host closed the connection) [09:24] *** PurpleSym has quit IRC (Quit: *) [09:25] *** PurpleSym has joined #archiveteam-bs [09:35] *** caff has quit IRC (Read error: Connection reset by peer) [10:41] *** bitBaron has joined #archiveteam-bs [10:49] *** bitBaron has quit IRC (My computer has gone to sleep. 😴😪ZZZzzz…) [10:50] *** odemg has quit IRC (Ping timeout: 260 seconds) [11:01] What is the general experience with server load caused by archiving? I am used to thinking of web crawling as a "drop in the ocean" kind of load but I hear/see a fair bit of concern about server load and rate limiting so I wonder if my impression is inaccurate. [11:02] *** odemg has joined #archiveteam-bs [11:02] *** BlueMaxim has joined #archiveteam-bs [11:03] We are gonna be taxing their servers with >2 connections per 200 ms. Which might strain their connection [11:04] For a warrior project, we might be hitting the server with >1000 connections so if their pipeline is not big enough then, its gonna over their pipeline. If processing power is insufficient then we are going to get error code 500s or some other code to tell us we are overloading their server [11:07] overload* [11:07] *** BlueMax has quit IRC (Read error: Operation timed out) [11:26] *** zino has quit IRC (Remote host closed the connection) [11:27] *** zino has joined #archiveteam-bs [11:29] Is that site RIP now? I can't connect from here, might be firewall though... [11:31] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [11:34] Tom's Hardware has been going through some internal struggles while switching from being a hardcore PC tech site to being a SEO driven click bait farm. They are now also apparently being sold. [11:34] I don't know how well their stuff is already covered in the archive, so it might be worth having a look at if someone else has some time. [11:36] They have 379 videos on their Youtube channel that are probably not covered by archives. [11:45] Oh great. http://www.tomshardware.com/ => "This URL has been excluded from the Wayback Machine." [11:45] The German version seems to be crawled every week though. [12:03] faolingfa: Depends a lot on the website, obviously. It's often more a matter of the sites rate-limiting us even if they have the resources to serve us simply because the sysadmin configured it that way. Another concern in some cases is the amount of traffic caused; small sites might have low traffic caps. [12:04] chr1sm: hiddenpalace.org works fine here. [12:30] zino tom's youtube? I am going to chuck it into tubeup [13:06] *** bitBaron has joined #archiveteam-bs [13:11] *** kiska has quit IRC (Remote host closed the connection) [13:11] *** kiskabak2 has quit IRC (Remote host closed the connection) [13:11] *** Flashfire has quit IRC (Remote host closed the connection) [13:11] *** kiskaBak has quit IRC (Remote host closed the connection) [13:12] *** kiska has joined #archiveteam-bs [13:12] *** kiskabak2 has joined #archiveteam-bs [13:13] *** Flashfire has joined #archiveteam-bs [13:13] *** w0rmhole has joined #archiveteam-bs [13:13] *** kiskaBak has joined #archiveteam-bs [13:16] *** bitBaron has quit IRC (Ping timeout: 480 seconds) [14:04] *** Pixi has quit IRC (Quit: Pixi) [14:07] *** Pixi has joined #archiveteam-bs [14:32] kiska, https://www.youtube.com/user/TomsHardware [14:38] *** schbirid has joined #archiveteam-bs [14:39] Ok I threw it into tubeup [14:50] \o/ [14:55] *** wp494 has quit IRC (Read error: Operation timed out) [14:55] *** wp494 has joined #archiveteam-bs [15:08] *** Muad-Dib has joined #archiveteam-bs [15:09] *** svchfoo1 sets mode: +o Muad-Dib [15:42] I am trying to familiarize myself with wpull. Is this up to date? https://wpull.readthedocs.io/en/master/install.html I ask because I get some fairly abrupt error right after installing it ("successfully") via pip: ImportError: cannot import name 'SSLCertificateError' [15:46] faolingfa: You need Tornado 4.x, not 5.x. And also html5lib==0.9999999, not a higher version. [15:46] You'll also want to use either wpull 1.2.3 or FalconK's (or my) fork. Version 2.0.1 is very unstable and hardly usable. [15:49] Oh man [15:49] Where's an exe file when you need one [15:56] *** offline_c has joined #archiveteam-bs [15:58] https://launchpad.net/wpull/+download oh, here is an exe file! :) [15:58] Oh yeah, that weird binary. Caused plenty of strange errors over at Newsgrabber before. [16:00] How about using a proper OS? ;-) [16:03] J. Kenji López-Alt a food blogger is closing his facebook page next week. See: https://m.facebook.com/story.php?story_fbid=1248348595307553&id=630532740422478 [16:03] Is there a good way to back that up before it's lost? [16:05] offline_c: There is no really "good" way, but I'm scraping it for posts now and will throw those into ArchiveBot later. Better than nothing at least. Won't grab all the comments etc. though. [16:09] JAA: thanks. [16:19] *** bitBaron has joined #archiveteam-bs [16:27] *** offline_c has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [16:54] *** zino has quit IRC (Remote host closed the connection) [16:54] *** zino has joined #archiveteam-bs [16:58] *** chferfa has joined #archiveteam-bs [17:56] *** caff has joined #archiveteam-bs [18:20] *** caff_ has joined #archiveteam-bs [18:27] *** caff has quit IRC (Read error: Operation timed out) [19:08] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [19:17] *** bitBaron has joined #archiveteam-bs [19:39] *** Mateon1 has quit IRC (Remote host closed the connection) [19:46] *** Raccoon has quit IRC (Remote host closed the connection) [19:46] *** Raccoon has joined #archiveteam-bs [19:48] *** Mateon1 has joined #archiveteam-bs [20:08] *** ndiddy has joined #archiveteam-bs [20:30] *** chferfa has quit IRC () [20:45] *** ndiddy has quit IRC (Ping timeout: 252 seconds) [21:03] *** ppsym has joined #archiveteam-bs [21:05] *** Flashfire has quit IRC (Ping timeout: 252 seconds) [21:05] *** w0rmhole has quit IRC (Ping timeout: 252 seconds) [21:05] *** kiskaBak has quit IRC (Ping timeout: 252 seconds) [21:05] *** PurpleSym has quit IRC (Ping timeout: 252 seconds) [21:05] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [21:05] *** Frogging has quit IRC (Ping timeout: 252 seconds) [21:05] *** hook54321 has quit IRC (Ping timeout: 252 seconds) [21:05] *** ppsym is now known as PurpleSym [21:06] *** Flashfire has joined #archiveteam-bs [21:06] *** medowar has joined #archiveteam-bs [21:06] *** i0npulse has joined #archiveteam-bs [21:07] *** w0rmhole has joined #archiveteam-bs [21:07] *** kiskaBak has joined #archiveteam-bs [21:07] *** hook54321 has joined #archiveteam-bs [21:08] *** Frogging has joined #archiveteam-bs [21:54] *** BlueMax has joined #archiveteam-bs [22:21] *** ndiddy has joined #archiveteam-bs [22:48] *** ndiddy has quit IRC (Ping timeout: 255 seconds) [22:53] *** schbirid has quit IRC (Remote host closed the connection) [23:28] *** Sk1d has joined #archiveteam-bs