#archiveteam 2014-07-20,Sun

↑back Search

Time Nickname Message
00:11 🔗 BlueMax https://torrentfreak.com/riaa-now-bullying-fully-licensed-zero-revenue-music-site-140719/ this is quite concerning
00:38 🔗 joepie91 BlueMax: why aren't these guys branded as mafia yet
00:39 🔗 BlueMax yeah :|
05:23 🔗 bsmith093 well, i think pidgin is dead for mint.
05:24 🔗 bsmith093 i havent been able to get on here for a day... a DAY.. because it decided to die
05:24 🔗 bsmith093 any projects going on?
12:18 🔗 Dec-31-99 Hey there, folks. https://secure.avaaz.org/en/petition/The_Internet_Archive_Include_Every_Site_on_the_Wayback_Machine_Regardless_of_Robotstxt
12:21 🔗 Dec-31-99 Hello?
12:29 🔗 nitro2k01 There are two different issues here, crawling a site a where robots.txt disallows it, and storing a site where robots.txt disallows it.
12:30 🔗 nitro2k01 Crawling a site in spite of robots.txt is rude and should be avoided. On the other hand, I'm seeing IA removing sites just because the domain name has expired and the new owner, mostly spam landing pages, disallow it.
12:30 🔗 nitro2k01 I would argue that the latter is the single biggest threat to historical information availability on IA.
12:30 🔗 nitro2k01 I've tried se
12:30 🔗 nitro2k01 veral times to contact them about it, but...
12:34 🔗 Dec-31-99 What happened?
12:34 🔗 nitro2k01 When I contacted them? Nothing, of course.
12:35 🔗 Dec-31-99 I worked very hard on explaining to archive.org where missing files on archive.org are supposed to go.
12:35 🔗 Dec-31-99 On the donkeykongcountry.com defunct site. No reply. Z_Z
12:36 🔗 Dec-31-99 I think someday the Archive Team should start their own Web Archive.
12:37 🔗 Dec-31-99 But why do they ignore our want for robots.txt to be demolished?
12:38 🔗 nitro2k01 Actually, as far as I know IA ignores robots.txt except when IA_Archiver is explicitly disallowed.
12:39 🔗 nitro2k01 But many domain nappers put those two fatal lines into robots.txt
12:39 🔗 nitro2k01 Disallow:
12:39 🔗 nitro2k01 User-agent: ia_archiver
12:39 🔗 Dec-31-99 Or when all robots are disallowed.
12:39 🔗 nitro2k01 And boom, the old content is gone from the archive as well.
12:39 🔗 Dec-31-99 Like:
12:39 🔗 Dec-31-99 User agent: *
12:39 🔗 Dec-31-99 Disallow: /
12:40 🔗 nitro2k01 No, I don't think it does. Maybe someone can confirm this.
12:40 🔗 Dec-31-99 From another user's link post on IA Forums: http://web.archive.org/web/20070103112847/http://www.infoceptor.com/
12:41 🔗 Dec-31-99 and... http://www.infoceptor.com/robots.txt
12:41 🔗 nitro2k01 Ok, you're right.
12:41 🔗 Dec-31-99 Robots.txt is a recipe for web annihilation!
12:42 🔗 Dec-31-99 Or when only specific directories are blocked from all web crawlers from accessing. Like: http://web.archive.org/*/google.com/search
12:42 🔗 Dec-31-99 But web.archive.org/*/google.com can be accessed
12:43 🔗 Dec-31-99 It's because their robots policy is written to exclude some directories and not all. This is common for many popular sites.
12:45 🔗 Dec-31-99 So how are we going to resolve this robots.txt problem?
12:45 🔗 Dec-31-99 Every few weeks I cross my fingers that archive.org destroys their robots.txt policy.
12:46 🔗 Dec-31-99 But then it doesn't happen!!! *d'ohpalm*
12:48 🔗 Dec-31-99 nitro2k01: How are we going to get rid of this issue? I wanted to access nintendo.co.uk's site, but it was excluded entirely per request by site owner...
12:48 🔗 Dec-31-99 It's a huge pain to see "Sorry. This url has been excluded from the Wayback Machine.:
12:49 🔗 Dec-31-99 The webpage actually gives me a 403 Forbidden, rather than a 404 Not Found. I used Live HTTP Headers to find that out.
12:49 🔗 Dec-31-99 So it is hidden in their servers, but they won't show it to the public.
16:31 🔗 chazchaz Some people are pretty confused.
16:33 🔗 yipdw Dec-31-99 should start a website about The IA Conspiracy
16:46 🔗 xmc Dec-31-99 should change their name to Dec-31-69
17:05 🔗 Nemo_bis To confuse UNIX time?
17:06 🔗 xmc yea
17:08 🔗 Nemo_bis I think their current level of confusion is sufficient
17:28 🔗 nitro2k01 It's not a conspiracy, just badly implemented policy. (Deleting old content because of a new robots.txt.)
17:30 🔗 chazchaz I don't think it's even deleted, it's just not public
17:30 🔗 chazchaz or is that not true?
17:30 🔗 yipdw it's not deleted
17:30 🔗 nitro2k01 Hopefully not deleted, yes. But that matters less for me since it's now inaccessible for all foreseeable future.
17:30 🔗 chazchaz Not that there's a huge practical difference, though
17:31 🔗 yipdw sufficiently bad policy is indistinguishable from conspiracy
17:36 🔗 Nemo_bis hmm
17:37 🔗 Nemo_bis So the existence of capital punishment in USA is a conspiracy?
17:37 🔗 yipdw sufficiently dry humor is indistinguishable from literary reference
17:37 🔗 yipdw also woop woop woop etc.
17:42 🔗 nitro2k01 Maybe we could organize a domain buyout as a protest. Buy the domain one by one from the domain nappers for a few hundred dollars or whatever they charge, then revert the robots.txt and hope that the IA Archiver picks it up and makes the archive available again.
17:42 🔗 nitro2k01 I'm joking of course, but if I had a domain that used to contain information I wanted very badly, I might've considered doing that.
17:43 🔗 Nemo_bis Might be their business model
18:23 🔗 ersi Dec-31-99 should.. stfu
18:23 🔗 ersi ;o
18:27 🔗 ersi nitro2k01: It's not deleted, FYI. It's darked.
18:28 🔗 nitro2k01 Ok.
18:28 🔗 ersi And if you want to keep talking about this bullshit, #archiveteam-bs is where you should go.
18:28 🔗 ersi Or to #carebox
18:30 🔗 nitro2k01 It's about the availablility of archived data, so it's not completely off topic. Then again, discussing it here won't make any difference whatsoever.
18:32 🔗 ersi Exactly, so it's completely off-topic. Also, it makes me furiously mad that this stupid subject gets drawn up so much.
18:34 🔗 xmc #internetarchive
18:34 🔗 xmc I completely agree with your sentiment, ersi
18:35 🔗 nitro2k01 Do you think it's a stupid subject because it's off-topic to the channel, or a stupid subject in general?
18:38 🔗 xmc it's off-topic to *this* channel, and discussed way out of proportion to how interesting it is
18:39 🔗 ersi It's also a stupid subject
18:40 🔗 ersi since it's not actually deleted (Ha-HA, you thought IA DELETE things?) but just hidden from people who get their panties in a bunch
18:43 🔗 nitro2k01 I'll just wait a few more years and see if they change their policies before they go bankrupt or have a datacenter fire. :p
18:43 🔗 xmc who?
18:43 🔗 nitro2k01 IA.
18:44 🔗 xmc why would you wish for either of those things to happen
18:44 🔗 nitro2k01 When did I say I did?
18:45 🔗 ersi You kinda indicated you wanted/wished for it, by the way you said it
18:45 🔗 nitro2k01 No.
18:45 🔗 ersi Try doing what IA does, in the United states of lawsuits.
18:46 🔗 ersi It's not going to be all archiving. Or fun..
18:46 🔗 xmc IA is coy about not actually deleting things on robots.txt exactly because it tends to deter lawsuit people
18:47 🔗 ersi and that's also why it's not written down specifically
18:47 🔗 nitro2k01 One way to resolve it is to ignore robots.txt for historical content, only for domains that now belong to domain nappers.
18:48 🔗 nitro2k01 But ok, sure.
18:49 🔗 ersi For what I've experienced, it doesn't help to even discuss these things. So let's not continue discussing this. Unless there's something you need help archiving, because it's not available on IA (which would be OK to talk about, even since it's still fucking archived just that you can't watch it).
18:49 🔗 ersi So let's talk about something that doesn't make me ban people, because that'd keep upsetting people.
18:50 🔗 xmc I'd support a ban policy for people who push this issue in #archiveteam
18:50 🔗 ersi Especially for repeat-offenders.
18:50 🔗 xmc "archiveteam: we're not archive.org"
18:50 🔗 nitro2k01 /topic
18:50 🔗 ersi That's true.
18:52 🔗 xmc done
18:52 🔗 nitro2k01 So, to discuss something that is on-topic. I asked someone to archive Rocketboom's videos, because they were announcing the videos were going to be deleted.
18:52 🔗 ersi damn you efnet
18:52 🔗 nitro2k01 If the nick limit wasn't bad enough...
18:52 🔗 xmc try removing the thefacebook url
18:53 🔗 nitro2k01 What's the general process after something has been archived? Torrent?
18:53 🔗 ersi I'll leave it be. It's (FB) popular amongst the kids and what not
18:53 🔗 xmc nitro2k01: who did you ask to do it?
18:53 🔗 nitro2k01 Let me check my logs.
18:54 🔗 nitro2k01 midas:
18:54 🔗 zenguy_pc anyone archiving reelradio or whatever service th e riaa is targeting?
18:57 🔗 garyrh zenguy_pc, there was a archivebot task for grabbing whatever *wasn't* behind a paywall, but i think it was aborted for some reason.
18:57 🔗 xmc oh, istr it got stuck
18:58 🔗 garyrh <yipdw> 814336k0tl443ilaam07k6u05 failed; reelradio.com can be requeued whenever
19:00 🔗 yipdw yeah, it was on a DO node that I ran
19:00 🔗 yipdw job got stuck on an empty reply, which is odd
19:32 🔗 Nemo_bis s/lengthy\/off-topic/lengthy\/off-topic\/robots.txt/
19:35 🔗 godane so sockington has a wikipedia page: http://en.wikipedia.org/wiki/Sockington
19:35 🔗 xmc you mean s,lengthy/off-topic,lengthy/off-topic/robots.txt,
19:36 🔗 xmc Nemo_bis: we can just respond with "that topic has been already deemed to be off-topic"
19:52 🔗 garyrh perhaps some sort of note about this should added to the wiki, as the petition Dec-31-99 started explicitly links it
19:53 🔗 garyrh s/it/to it/

irclogger-viewer