[00:02] *** SN4T14 has quit IRC (Quit: ZNC 1.6.3 - http://znc.in) [00:03] *** SN4T14 has joined #archiveteam-bs [00:11] *** nightpool has quit IRC (Read error: Operation timed out) [00:13] *** nightpool has joined #archiveteam-bs [01:01] *** TheLovina has quit IRC (Read error: Operation timed out) [01:03] *** TheLovina has joined #archiveteam-bs [01:16] !a http://82.221.129.208/ --useragent firefox [01:28] *** username1 has joined #archiveteam-bs [01:31] *** schbirid2 has quit IRC (Read error: Operation timed out) [01:51] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [02:41] *** Odd0002 has quit IRC (Remote host closed the connection) [03:00] So I just saw this: https://github.com/chfoo/wpull/issues/356 [03:00] Would it be possible to incentivize sites to not disallow ia_archiver in their robots.txt file by respecting delay specified in robots.txt? [03:01] We don't negotiate with terrorists [03:01] lol. [03:02] :p [03:06] but like if we were going to do that as the issue suggests, i don't see why we would want to cooperate with people that disallow the wayback machine. [03:07] i think that it's stupid that some sites try to tell people to use a crawl delay of 10 seconds though [03:27] Brendan Eich appears to be supporting this: https://github.com/EdOverflow/security-txt [03:32] *** qw3rty119 has joined #archiveteam-bs [03:38] *** qw3rty118 has quit IRC (Read error: Operation timed out) [03:51] *** Stilett0 has joined #archiveteam-bs [04:09] Wiki is acting kinda funny [04:24] JAA: Daily Stormer is moving to the TOR Network [04:26] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:33] *** Sk1d has joined #archiveteam-bs [04:37] Apparently Google froze their domain, so they can't move it now. [04:46] *** robink has quit IRC (Read error: Connection reset by peer) [04:51] *** robink has joined #archiveteam-bs [04:55] *** dashcloud has quit IRC (Read error: Operation timed out) [05:01] *** dashcloud has joined #archiveteam-bs [05:05] *** kimmer22 has joined #archiveteam-bs [05:14] *** kimmer2 has quit IRC (Ping timeout: 633 seconds) [05:20] *** Stilett0 is now known as Stiletto [05:25] *** kimmer2 has joined #archiveteam-bs [05:33] *** kimmer22 has quit IRC (Ping timeout: 633 seconds) [05:53] hook54321: Something that might be more fruitful is checking what the support for HTTP error 429 is in wpull. I've seen logs where we get a lot of 429s followed by a 200 followed by a lot of 429s again. RFC6585. Either: [05:53] 1) wpull does not handle the Retry-After header [05:53] 2) The site is still not prepared to answer requests after timeout [05:53] 3) The site does not send a Rety-After header [05:53] If it's 2 or 3, then it's not much we can do, if it's 1 we would probably save all sides trouble by implementing it, and minimize chances to get IP-banned. Then add a pipeline override if there is reason to ignore requests from the server to back off. [05:53] *** HCross has quit IRC (Read error: Connection reset by peer) [05:54] *** HCross has joined #archiveteam-bs [05:55] *** robogoat has quit IRC (Read error: Operation timed out) [05:56] *** robogoat has joined #archiveteam-bs [06:19] *** kimmer22 has joined #archiveteam-bs [06:19] *** godane has quit IRC (Quit: Leaving.) [06:26] *** kimmer2 has quit IRC (Ping timeout: 633 seconds) [06:49] *** j08nY has joined #archiveteam-bs [06:59] *** dashcloud has quit IRC (Read error: Operation timed out) [07:14] *** BlueMaxim has joined #archiveteam-bs [07:15] *** kimmer2 has joined #archiveteam-bs [07:15] *** TheLovina has quit IRC (Ping timeout: 370 seconds) [07:15] *** TheLovina has joined #archiveteam-bs [07:20] *** kimmer22 has quit IRC (Ping timeout: 633 seconds) [07:20] *** dashcloud has joined #archiveteam-bs [07:28] *** Boppen has quit IRC (Ping timeout: 194 seconds) [07:41] *** BlueMaxim has quit IRC (Read error: Operation timed out) [07:41] *** BlueMaxim has joined #archiveteam-bs [07:48] *** Honno has joined #archiveteam-bs [07:49] *** HCross has quit IRC (Remote host closed the connection) [07:49] *** HCross has joined #archiveteam-bs [08:32] *** j08nY has quit IRC (Read error: Operation timed out) [08:34] *** kimmer22 has joined #archiveteam-bs [08:38] *** kimmer2 has quit IRC (Ping timeout: 633 seconds) [08:40] *** kimmer2 has joined #archiveteam-bs [08:40] *** Boppen has joined #archiveteam-bs [08:45] *** kimmer22 has quit IRC (Ping timeout: 633 seconds) [08:45] *** kimmer22 has joined #archiveteam-bs [08:50] *** kimmer2 has quit IRC (Ping timeout: 632 seconds) [08:51] JAA: Onion address for Daily Stormer: http://dstormer6em3i4km.onion/ [08:51] *** BlueMaxim has quit IRC (Quit: Leaving) [09:25] *** kimmer2 has joined #archiveteam-bs [09:30] *** kimmer22 has quit IRC (Ping timeout: 633 seconds) [09:32] *** kimmer1 has joined #archiveteam-bs [09:36] *** godane has joined #archiveteam-bs [09:37] looks like IA is down again [09:48] yup [09:49] nothing on their twitter yet. [10:19] *** Honno has quit IRC (Read error: Operation timed out) [10:30] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [10:30] *** Mateon1 has joined #archiveteam-bs [10:48] *** j08nY has joined #archiveteam-bs [10:56] *** ivan has quit IRC (Leaving) [11:18] *** marvinw has joined #archiveteam-bs [11:21] Very interesting court decision: https://www.reuters.com/article/us-microsoft-linkedin-ruling-idUSKCN1AU2BV [11:44] *** atluxity1 has joined #archiveteam-bs [11:46] *** atluxity has quit IRC (Ping timeout: 506 seconds) [11:50] We should start archiving whois information. [11:50] And DNS records [12:43] holy shit [12:43] that is actually a Very Big Deal [13:17] *** s2e has joined #archiveteam-bs [13:27] Is there guidance on how to best submit dozens of websites to the internet archive in a way that is respectful of their infrastructure? I work in the internet freedom sector focusing on educational content and many of the resources that get created dissapear in months or a few years. I currently use a simple script to spider and submit new ones to the archive. I'd like to do this in a more automated fashion. [13:27] But, I want to make sure I am doing it as respectfully as possible. [13:29] to IA's infrastructure? [13:29] *** j08nY has quit IRC (Read error: Operation timed out) [13:29] I mean, respectful of IA's infrastructure? [13:29] you probably want archivebot [13:29] Yeah, if possible. I've seen other efforts try to archive seperately, but they are largely unavailable to others [13:30] join #archivebot, check out how it works, submit a website with !a, watch the dashboard, it'll get absorbed into wayback [13:30] awesome [13:32] joepie91: eli5? [13:33] Since archivebot is a volunteer service is the method it uses the best method for doing this without a drain on others resources? Is it something I could run on my own to do the archiving and supply the WARC files in the same way? [13:33] Frogging: my understanding is - it is legal to scrape public personal information on websites for commercial purposes [13:34] s2e: you could provide a pipeline, but I'm not sure if we're accepting right now; or you can run something like grab-site yourself, but you'd have to find some avenue to get the warcs into wayback. [13:35] Frogging: not only it is legal, you cannot put measures in place against it [13:35] I see. [13:35] IANAL [13:35] Sanqui: Thanks! I'll start with archivebot and bother IA about WARC inclusion. [13:35] the applications they mentioned on the page don't instill confidence [13:36] using "publicly available data and artificial intelligence to help companies identify potential customers" [13:36] building "algorithms capable of predicting employee behaviors, such as when they might quit" [13:37] "If LinkedIn is going to allow profiles to be indexed by search engines to benefit their platform then why shouldn't the rest of the internet benefit from that as well?" [13:40] *** Mateon1 has quit IRC (Remote host closed the connection) [13:40] *** kimmer22 has joined #archiveteam-bs [13:41] *** Mateon1 has joined #archiveteam-bs [13:43] *** s2e has left WeeChat 1.6 [13:47] *** kimmer2 has quit IRC (Ping timeout: 633 seconds) [14:15] *** j08nY has joined #archiveteam-bs [15:01] *** pizzaiolo has joined #archiveteam-bs [16:04] *** wabu has quit IRC (Read error: Operation timed out) [16:09] *** kimmer2 has joined #archiveteam-bs [16:13] *** username1 is now known as schbirid [16:14] *** wabu has joined #archiveteam-bs [16:17] *** kimmer22 has quit IRC (Ping timeout: 633 seconds) [17:07] *** pizzaiolo has quit IRC (pizzaiolo) [17:08] JAA, joepie91: i was talking with FalconK the other day, and he mentioned the idea of running a recursive resolver that archives all results, and having archivebot and the warrior use it as their default resolvers [17:08] i really like this idea [17:09] i'm not sure what the proper archival format for DNS would be [17:09] I suppose you could cram it into a warc [17:12] i thought warc is http [17:12] *think [17:12] It is not limited to HTTP, there’s a generic “resource” record. [17:13] oh nice [17:16] this looks like a torrent of the IA 911 videos: http://torrentproject.se/2d64409b6f179bc999159284156b3534711447a1/ [17:16] Also, DNS perfectly fits into the request/response scheme WARC is using for HTTP. [17:19] That's a nice idea, apart from the fact that it introduces a single point of failure. If the resolver is down, *everything* crashes and burns. [17:22] yes, also that [17:33] xmc: schbirid: heritrix stores DNS records in WARCs [17:33] or well, DNS requests and responses [17:33] hmmmmm [17:36] *** kristian_ has joined #archiveteam-bs [18:14] *** kristian_ has quit IRC (Quit: Leaving) [18:23] so my birthday is tomorrow [18:36] happy birthday godane (I would forget to say this tomorrow :p) [18:53] *** fie_ has quit IRC (Ping timeout: 246 seconds) [19:11] *** fie has joined #archiveteam-bs [19:26] godane: happy birthday [19:44] *** kimmer2 has quit IRC (Ping timeout: 633 seconds) [20:16] *** kimmer1 has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) [20:56] Anyone know if there's something like this for Firefox? https://github.com/kissarat/never-lose [21:08] *** bwn has quit IRC (Ping timeout: 268 seconds) [21:13] *** bwn has joined #archiveteam-bs [21:56] *** Honno has joined #archiveteam-bs [22:03] it's 00:03 here now, happy birthday godane :D [22:16] *** DFJustin has quit IRC (Read error: Connection reset by peer) [22:17] *** DFJustin has joined #archiveteam-bs [22:18] *** dashcloud has quit IRC (Read error: Operation timed out) [22:18] *** dashcloud has joined #archiveteam-bs [22:23] *** pikhq has quit IRC (Read error: Operation timed out) [22:23] that repo's list of porn sites seems to have a disproportionate amount of gay porn [22:24] and random tumblrs. interesting. I wonder where they got it from [22:38] *** Igloo has quit IRC (Read error: Operation timed out) [22:38] *** j08nY has quit IRC (Read error: Operation timed out) [22:42] *** pikhq has joined #archiveteam-bs [22:43] *** godane has quit IRC (Ping timeout: 250 seconds) [22:43] *** Jonimus has quit IRC (Ping timeout: 268 seconds) [22:45] *** j08nY has joined #archiveteam-bs [22:47] *** godane has joined #archiveteam-bs [22:47] *** Igloo has joined #archiveteam-bs [22:56] * hook54321 shrugs [23:08] *** qw3rty111 has joined #archiveteam-bs [23:10] *** Jonimus has joined #archiveteam-bs [23:10] *** swebb sets mode: +o Jonimus [23:11] *** qw3rty112 has joined #archiveteam-bs [23:11] *** qw3rty119 has quit IRC (Ping timeout: 600 seconds) [23:18] *** qw3rty111 has quit IRC (Read error: Operation timed out) [23:30] *** j08nY has quit IRC (Quit: Leaving)