#archiveteam 2014-06-25,Wed

↑back Search

Time Nickname Message
10:32 🔗 Nemo_bis So much hate for bots: 1 0 , 5 8 0 , 4 9 8 , 3 1 5 Bad Bots Blocked http://www.distilnetworks.com/
10:55 🔗 midas Ia_archiver is blocked
10:56 🔗 midas or, could be blocked
10:56 🔗 midas but it isnt clear what they do and dont block
12:57 🔗 balrog midas: probably is
12:57 🔗 balrog Ia_archiver
12:57 🔗 balrog Ia_archiver is a web crawler for Alexa, an analytics and web information company.
12:57 🔗 balrog they consider it an alexa analytics crawler which it isn't
13:31 🔗 Nemo_bis they might be a decade or two out of date
13:42 🔗 midas not too shabby
14:06 🔗 balrog https://alexa.zendesk.com/hc/en-us/articles/200450194-Alexa-s-Web-and-Site-Audit-Crawlers
14:06 🔗 balrog that says nothing about IA
14:31 🔗 DFJustin http://archivebot.at.ninjawedding.org:4567/#/histories/http://cryptome.org/
14:31 🔗 DFJustin er
14:38 🔗 Nemo_bis 300 GB almost? :o http://archivebot.at.ninjawedding.org:4567/#/histories/http://wdl2.winworldpc.com/
14:39 🔗 DFJustin yeah I guess it won't show up on there until the job finishes properly
14:39 🔗 DFJustin which will need manual intervention from SketchCow
14:41 🔗 Nemo_bis lol what sense does this make http://archivebot.at.ninjawedding.org:4567/#/histories/https://wiki.archlinux.org/
14:42 🔗 Nemo_bis that wiki is hyper-easy to archive and has almost no custom extensions or anyting
14:45 🔗 ivan` does the hyper-easy archiving thing put pages into wayback?
14:46 🔗 ivan` we've been grabbing a lot of wikis with archivebot just for that
14:50 🔗 Nemo_bis Sure, that's a benefit
14:51 🔗 Nemo_bis But what's the benefit of sending the poor archivebot in Special:WhatLinksHere rabbit holes :)
14:52 🔗 * ats imagines aggregating all the grabbed wikis into an enormous meta-wiki, so you can follow links between them easily...
14:52 🔗 Nemo_bis ats, that was the point of the interwikis when they were invented :)
14:52 🔗 Nemo_bis OTOH in this case they can't blame us, there isn't even a robots.txt AFAICS https://wiki.archlinux.org/robots.txt
14:53 🔗 ats yeah, but that requires the authors to be aware of stuff on other wikis...
14:53 🔗 Nemo_bis http://meatballwiki.org/wiki/InterWiki
14:54 🔗 Nemo_bis Well, not according to some, ats: "InterWikiSearch deals with the obvious problem of not knowing what is where."
14:54 🔗 Nemo_bis However that had only been implemented for the c2.com wikis and perhaps communitywiki IIRC
15:23 🔗 balrog why not make a bot like archivebot that instead uses DumpGenerator for wikis?
15:24 🔗 exmic hmmmm
15:25 🔗 exmic yes
15:26 🔗 DFJustin I would like that although I checked the wikis on my to-archive list the other day and wikiteam already hit them, I think their coverage is pretty high at this point
15:36 🔗 Nemo_bis Current plan is to buy a list of MediaWikis from one of those web-crawlers
15:37 🔗 Nemo_bis But mostly we need to make our code more modern to catch all those wikis which we fail to download for silly urllib2 errors and the like
15:37 🔗 Nemo_bis When we've done that we could launch it all over Wikia with Warrior, haha
15:49 🔗 balrog Nemo_bis: I thought I got one of those bugs fixed
15:49 🔗 balrog because urllib2 wasn't being used right :P
15:55 🔗 Nemo_bis true but there's so many
15:56 🔗 balrog report them all
15:56 🔗 Nemo_bis they're surely all features, not bugs
15:56 🔗 Nemo_bis But still they kill download of about 2400 wikis we have, I expect
15:57 🔗 balrog they're probably bugs in the downloader :/
15:57 🔗 balrog 2400?
15:57 🔗 balrog yeah, that certainly has to be looked at
18:29 🔗 wp494 RIP appurify
18:29 🔗 wp494 just announced on google i/o
18:33 🔗 SketchCow Grabbing
18:40 🔗 wp494 looks like twitch lives on for another while
18:58 🔗 Jonimus If twitch dies livestream or similar will hopefully take their place, thought all that lost archived video will be sad
19:14 🔗 APerti Twitch has zillions of dollars from M$, no?
19:15 🔗 midas Jonimus: and all we could do is dance around the burning rubble
19:18 🔗 db48x nah, we could archive twitch
19:18 🔗 db48x we should be archiving twitch
19:24 🔗 midas all of twitch might be a tad harder, but sure
19:24 🔗 midas go ahead

irclogger-viewer