[00:00] *** yuitimoth has quit IRC (Ping timeout: 260 seconds) [00:07] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [00:08] *** pfallenop has joined #archiveteam [00:13] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [00:17] *** yuitimoth has joined #archiveteam [00:19] *** pfallenop has joined #archiveteam [00:25] *** dashcloud has joined #archiveteam [00:25] *** Okgo has joined #archiveteam [00:27] Ffx [00:27] Ffx [00:30] *** BlueMaxim has joined #archiveteam [00:32] *** Okgo has quit IRC (Ping timeout: 268 seconds) [00:42] *** zout has quit IRC (Ping timeout: 244 seconds) [00:45] *** zout has joined #archiveteam [01:03] *** zout has quit IRC (Ping timeout: 244 seconds) [01:07] *** zout has joined #archiveteam [01:11] *** zout has quit IRC (Ping timeout: 244 seconds) [01:25] *** zout has joined #archiveteam [01:42] *** zout has quit IRC (Ping timeout: 244 seconds) [01:46] *** zout has joined #archiveteam [03:49] *** maelstrom has quit IRC (Quit: Leaving) [03:52] *** wp494 has quit IRC (Read error: Connection reset by peer) [04:04] *** ndiddy has quit IRC (Ping timeout: 244 seconds) [04:12] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [04:17] *** mutoso has joined #archiveteam [04:43] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:49] *** Sk1d has joined #archiveteam [06:32] *** Aranje has quit IRC (Quit: Three sheets to the wind) [06:50] *** wp494 has joined #archiveteam [06:54] *** fie has joined #archiveteam [07:12] *** atomotic has joined #archiveteam [07:30] *** ravetcofx has quit IRC (Read error: Operation timed out) [07:47] does anybody have any opinions on doing a capture of hackforums.net? [07:47] scum-of-the-internet types, but they block almost all archival services like archive.org. [07:48] they're frequently referenced by journalists like krebsonsecurity.com, but the complete lack of archival means all the content only lives on in screenshots. [07:48] one forum I go to has auto-bans set up for hackforums.net referrers, it's pretty funny [07:49] the very index requires a captcha though [07:49] so not archivebottable [07:49] hackforums itself will ban any IP address using wget, or is in MaxMind MinFraud as being a VPN/VPS. [07:49] index? you can crawl with a logged in account without hitting recaptcha, so long as you heavily spoof the headers and the UA. [07:50] archivebot will almost certainly be IP address blacklisted. [07:50] nah, archivebot is ran privately, they won't have the ips [07:50] (no relation to ia) [07:51] is its range anything that could be a rented server? [07:51] depends on the pipeline [07:51] oh hold up, I just had a fantastic idea. [07:51] wait, no, it's already behind cloudflare. [07:52] yeah cloudflare is poopoo atm [07:52] drat. often you can abuse cloudflare to reverse proxy websites for you for scrapes, I don't know if that's public knowledge. [07:53] you mean revealing the ip? [07:53] ie, sign up for CF yourself and set the origin as the site you want to archive. [07:53] oh wow, that's hilarious [07:53] CF has lots of IP addresses they use to make the actual request to the origin server. [07:54] sadly, if they're already behind CF then that doesn't work. [07:56] tempted to just attempt to pull hackforums.net myself on my residential connection (which isn't banned). there's an awful lot of content though. [08:04] be better if I could find a VPS or proxy that isn't banned though. [08:06] I think it's using MaxMind GEOIP2, so it might just be a matter of finding a very new IP range that's not in the DB yet. [08:42] :\ [08:42] found an IP address that isn't banned! pity it throws clownflare CAPTCHAs, I can't scrape millions of URLs like that :( [08:44] 7.2M total pages in the queue, actually. [08:50] ok- IPV6 gateways aren't banned. excellent. [09:41] *** kurt has joined #archiveteam [11:10] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:44] *** Morbus has joined #archiveteam [11:47] *** kyounko has quit IRC (Read error: Operation timed out) [12:04] *** atomotic has joined #archiveteam [12:08] *** BlueMaxim has quit IRC (Quit: Leaving) [12:51] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [14:34] *** Start has quit IRC (Quit: Disconnected.) [14:34] *** Start has joined #archiveteam [14:35] *** Start has quit IRC (Client Quit) [14:45] *** achip has joined #archiveteam [15:29] *** kurt has quit IRC (Remote host closed the connection) [15:33] *** maelstrom has joined #archiveteam [15:38] *** bRick5772 has joined #archiveteam [16:03] *** ZeoNet has joined #archiveteam [16:15] *** VADemon has joined #archiveteam [16:20] *** ZeoNet_ has joined #archiveteam [16:22] *** ZeoNet has quit IRC (Ping timeout: 370 seconds) [16:31] *** ZeoNet has joined #archiveteam [16:32] *** ZeoNet_ has quit IRC (Ping timeout: 370 seconds) [16:35] *** AlexLehm has joined #archiveteam [16:49] zout: you're archiving into WARCs right? [17:04] *** ZeoNet_ has joined #archiveteam [17:07] *** ZeoNet has quit IRC (Read error: Operation timed out) [17:15] *** nwf has quit IRC (WeeChat 1.5) [17:17] *** nwf has joined #archiveteam [17:24] *** Swizzle has quit IRC (Quit: Leaving) [17:39] *** ZeoNet_ has quit IRC (Read error: Operation timed out) [18:03] *** VADemon has quit IRC (Read error: Operation timed out) [18:04] *** vOYtEC has quit IRC (Read error: Connection reset by peer) [18:04] *** vOYtEC has joined #archiveteam [18:32] *** ravetcofx has joined #archiveteam [19:11] *** bRick5772 has quit IRC (Quit: Leaving.) [19:57] *** SketchCow has joined #archiveteam [19:57] *** swebb sets mode: +o SketchCow [20:01] arkiver: not yet, but when I am, yes. [20:08] Hiiiii [20:14] hi [20:16] *** jessew-el has joined #archiveteam [20:20] How well does Archivebot handle Storify? I didn't see any plans on the wiki yet I don't have any urgent reason to think it's dying right now, but it has a lot of valuable material. [20:21] try it and see [20:22] you can try loading a storify page without javascript. [20:25] *** jessew-el has quit IRC (Ping timeout: 268 seconds) [20:32] *** Stiletto has quit IRC () [20:35] answer is, it loads fine. [20:35] the pagination is broken in a browser but should be fine with some hackery. [20:38] "some hackery" isn't really possible with archivebot [20:38] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [20:40] *** acridAxid has quit IRC (Quit: marauder) [20:41] *** acridAxid has joined #archiveteam [20:42] Shomi (a Netflix competitor in Canada) announced they're going to take themselves out to the back of the shed on november 30th [20:42] http://shomimedia.com/news-feed/press-releases/2016/09/cue-the-closing-credits-shomi-thanks-canadians-for-a-good-run [20:42] given its tv streaming nature there probably wouldn't be a whole lot we can get out hands on, but would be handy to toss social accounts into archivebot [20:43] and maybe the main site itself [20:51] *** Stiletto has joined #archiveteam [21:03] *** ndiddy has joined #archiveteam [21:05] *** computerf has quit IRC (Read error: Operation timed out) [21:08] *** computerf has joined #archiveteam [21:08] *** RichardG has joined #archiveteam [21:11] I've found that crawls really get lengthy if I don't bother making custom whitelists for things. [21:11] the difference between something targeted and something that sits and gathers a thousand variations of the same page with wget is well worth it. [21:24] *** computerf has quit IRC (Read error: Operation timed out) [21:35] *** computerf has joined #archiveteam [21:41] *** ndiddy has quit IRC (Quit: Leaving) [21:51] *** yeoldetoa has quit IRC (Remote host closed the connection) [21:52] *** kristian_ has joined #archiveteam [22:15] *** pfallenop has quit IRC (Quit: Lost terminal) [22:22] *** BlueMaxim has joined #archiveteam [22:31] *** AlexLehm has quit IRC (Ping timeout: 244 seconds) [22:32] *** pfallenop has joined #archiveteam [22:33] *** RoanKatto has joined #archiveteam [22:33] *** Start has joined #archiveteam [22:34] *** Morbus has quit IRC (Read error: Operation timed out) [22:35] *** achip has quit IRC (Read error: Operation timed out) [23:49] *** vOYtEC has quit IRC (Ping timeout: 244 seconds) [23:52] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)