[10:32] So much hate for bots: 1 0 , 5 8 0 , 4 9 8 , 3 1 5 Bad Bots Blocked http://www.distilnetworks.com/ [10:55] Ia_archiver is blocked [10:56] or, could be blocked [10:56] but it isnt clear what they do and dont block [12:57] midas: probably is [12:57] Ia_archiver [12:57] Ia_archiver is a web crawler for Alexa, an analytics and web information company. [12:57] they consider it an alexa analytics crawler which it isn't [13:31] they might be a decade or two out of date [13:42] not too shabby [14:06] https://alexa.zendesk.com/hc/en-us/articles/200450194-Alexa-s-Web-and-Site-Audit-Crawlers [14:06] that says nothing about IA [14:31] http://archivebot.at.ninjawedding.org:4567/#/histories/http://cryptome.org/ [14:31] er [14:38] 300 GB almost? :o http://archivebot.at.ninjawedding.org:4567/#/histories/http://wdl2.winworldpc.com/ [14:39] yeah I guess it won't show up on there until the job finishes properly [14:39] which will need manual intervention from SketchCow [14:41] lol what sense does this make http://archivebot.at.ninjawedding.org:4567/#/histories/https://wiki.archlinux.org/ [14:42] that wiki is hyper-easy to archive and has almost no custom extensions or anyting [14:45] does the hyper-easy archiving thing put pages into wayback? [14:46] we've been grabbing a lot of wikis with archivebot just for that [14:50] Sure, that's a benefit [14:51] But what's the benefit of sending the poor archivebot in Special:WhatLinksHere rabbit holes :) [14:52] * ats imagines aggregating all the grabbed wikis into an enormous meta-wiki, so you can follow links between them easily... [14:52] ats, that was the point of the interwikis when they were invented :) [14:52] OTOH in this case they can't blame us, there isn't even a robots.txt AFAICS https://wiki.archlinux.org/robots.txt [14:53] yeah, but that requires the authors to be aware of stuff on other wikis... [14:53] http://meatballwiki.org/wiki/InterWiki [14:54] Well, not according to some, ats: "InterWikiSearch deals with the obvious problem of not knowing what is where." [14:54] However that had only been implemented for the c2.com wikis and perhaps communitywiki IIRC [15:23] why not make a bot like archivebot that instead uses DumpGenerator for wikis? [15:24] hmmmm [15:25] yes [15:26] I would like that although I checked the wikis on my to-archive list the other day and wikiteam already hit them, I think their coverage is pretty high at this point [15:36] Current plan is to buy a list of MediaWikis from one of those web-crawlers [15:37] But mostly we need to make our code more modern to catch all those wikis which we fail to download for silly urllib2 errors and the like [15:37] When we've done that we could launch it all over Wikia with Warrior, haha [15:49] Nemo_bis: I thought I got one of those bugs fixed [15:49] because urllib2 wasn't being used right :P [15:55] true but there's so many [15:56] report them all [15:56] they're surely all features, not bugs [15:56] But still they kill download of about 2400 wikis we have, I expect [15:57] they're probably bugs in the downloader :/ [15:57] 2400? [15:57] yeah, that certainly has to be looked at [18:29] RIP appurify [18:29] just announced on google i/o [18:33] Grabbing [18:40] looks like twitch lives on for another while [18:58] If twitch dies livestream or similar will hopefully take their place, thought all that lost archived video will be sad [19:14] Twitch has zillions of dollars from M$, no? [19:15] Jonimus: and all we could do is dance around the burning rubble [19:18] nah, we could archive twitch [19:18] we should be archiving twitch [19:24] all of twitch might be a tad harder, but sure [19:24] go ahead