#archiveteam 2014-07-20,Sun

↑back Search

Time	Nickname	Message
00:11 ^🔗	BlueMax	https://torrentfreak.com/riaa-now-bullying-fully-licensed-zero-revenue-music-site-140719/ this is quite concerning
00:38 ^🔗	joepie91	BlueMax: why aren't these guys branded as mafia yet
00:39 ^🔗	BlueMax	yeah :\|
05:23 ^🔗	bsmith093	well, i think pidgin is dead for mint.
05:24 ^🔗	bsmith093	i havent been able to get on here for a day... a DAY.. because it decided to die
05:24 ^🔗	bsmith093	any projects going on?
12:18 ^🔗	Dec-31-99	Hey there, folks. https://secure.avaaz.org/en/petition/The_Internet_Archive_Include_Every_Site_on_the_Wayback_Machine_Regardless_of_Robotstxt
12:21 ^🔗	Dec-31-99	Hello?
12:29 ^🔗	nitro2k01	There are two different issues here, crawling a site a where robots.txt disallows it, and storing a site where robots.txt disallows it.
12:30 ^🔗	nitro2k01	Crawling a site in spite of robots.txt is rude and should be avoided. On the other hand, I'm seeing IA removing sites just because the domain name has expired and the new owner, mostly spam landing pages, disallow it.
12:30 ^🔗	nitro2k01	I would argue that the latter is the single biggest threat to historical information availability on IA.
12:30 ^🔗	nitro2k01	I've tried se
12:30 ^🔗	nitro2k01	veral times to contact them about it, but...
12:34 ^🔗	Dec-31-99	What happened?
12:34 ^🔗	nitro2k01	When I contacted them? Nothing, of course.
12:35 ^🔗	Dec-31-99	I worked very hard on explaining to archive.org where missing files on archive.org are supposed to go.
12:35 ^🔗	Dec-31-99	On the donkeykongcountry.com defunct site. No reply. Z_Z
12:36 ^🔗	Dec-31-99	I think someday the Archive Team should start their own Web Archive.
12:37 ^🔗	Dec-31-99	But why do they ignore our want for robots.txt to be demolished?
12:38 ^🔗	nitro2k01	Actually, as far as I know IA ignores robots.txt except when IA_Archiver is explicitly disallowed.
12:39 ^🔗	nitro2k01	But many domain nappers put those two fatal lines into robots.txt
12:39 ^🔗	nitro2k01	Disallow:
12:39 ^🔗	nitro2k01	User-agent: ia_archiver
12:39 ^🔗	Dec-31-99	Or when all robots are disallowed.
12:39 ^🔗	nitro2k01	And boom, the old content is gone from the archive as well.
12:39 ^🔗	Dec-31-99	Like:
12:39 ^🔗	Dec-31-99	User agent: *
12:39 ^🔗	Dec-31-99	Disallow: /
12:40 ^🔗	nitro2k01	No, I don't think it does. Maybe someone can confirm this.
12:40 ^🔗	Dec-31-99	From another user's link post on IA Forums: http://web.archive.org/web/20070103112847/http://www.infoceptor.com/
12:41 ^🔗	Dec-31-99	and... http://www.infoceptor.com/robots.txt
12:41 ^🔗	nitro2k01	Ok, you're right.
12:41 ^🔗	Dec-31-99	Robots.txt is a recipe for web annihilation!
12:42 ^🔗	Dec-31-99	Or when only specific directories are blocked from all web crawlers from accessing. Like: http://web.archive.org/*/google.com/search
12:42 ^🔗	Dec-31-99	But web.archive.org/*/google.com can be accessed
12:43 ^🔗	Dec-31-99	It's because their robots policy is written to exclude some directories and not all. This is common for many popular sites.
12:45 ^🔗	Dec-31-99	So how are we going to resolve this robots.txt problem?
12:45 ^🔗	Dec-31-99	Every few weeks I cross my fingers that archive.org destroys their robots.txt policy.
12:46 ^🔗	Dec-31-99	But then it doesn't happen!!! d'ohpalm
12:48 ^🔗	Dec-31-99	nitro2k01: How are we going to get rid of this issue? I wanted to access nintendo.co.uk's site, but it was excluded entirely per request by site owner...
12:48 ^🔗	Dec-31-99	It's a huge pain to see "Sorry. This url has been excluded from the Wayback Machine.:
12:49 ^🔗	Dec-31-99	The webpage actually gives me a 403 Forbidden, rather than a 404 Not Found. I used Live HTTP Headers to find that out.
12:49 ^🔗	Dec-31-99	So it is hidden in their servers, but they won't show it to the public.
16:31 ^🔗	chazchaz	Some people are pretty confused.
16:33 ^🔗	yipdw	Dec-31-99 should start a website about The IA Conspiracy
16:46 ^🔗	xmc	Dec-31-99 should change their name to Dec-31-69
17:05 ^🔗	Nemo_bis	To confuse UNIX time?
17:06 ^🔗	xmc	yea
17:08 ^🔗	Nemo_bis	I think their current level of confusion is sufficient
17:28 ^🔗	nitro2k01	It's not a conspiracy, just badly implemented policy. (Deleting old content because of a new robots.txt.)
17:30 ^🔗	chazchaz	I don't think it's even deleted, it's just not public
17:30 ^🔗	chazchaz	or is that not true?
17:30 ^🔗	yipdw	it's not deleted
17:30 ^🔗	nitro2k01	Hopefully not deleted, yes. But that matters less for me since it's now inaccessible for all foreseeable future.
17:30 ^🔗	chazchaz	Not that there's a huge practical difference, though
17:31 ^🔗	yipdw	sufficiently bad policy is indistinguishable from conspiracy
17:36 ^🔗	Nemo_bis	hmm
17:37 ^🔗	Nemo_bis	So the existence of capital punishment in USA is a conspiracy?
17:37 ^🔗	yipdw	sufficiently dry humor is indistinguishable from literary reference
17:37 ^🔗	yipdw	also woop woop woop etc.
17:42 ^🔗	nitro2k01	Maybe we could organize a domain buyout as a protest. Buy the domain one by one from the domain nappers for a few hundred dollars or whatever they charge, then revert the robots.txt and hope that the IA Archiver picks it up and makes the archive available again.
17:42 ^🔗	nitro2k01	I'm joking of course, but if I had a domain that used to contain information I wanted very badly, I might've considered doing that.
17:43 ^🔗	Nemo_bis	Might be their business model
18:23 ^🔗	ersi	Dec-31-99 should.. stfu
18:23 ^🔗	ersi	;o
18:27 ^🔗	ersi	nitro2k01: It's not deleted, FYI. It's darked.
18:28 ^🔗	nitro2k01	Ok.
18:28 ^🔗	ersi	And if you want to keep talking about this bullshit, #archiveteam-bs is where you should go.
18:28 ^🔗	ersi	Or to #carebox
18:30 ^🔗	nitro2k01	It's about the availablility of archived data, so it's not completely off topic. Then again, discussing it here won't make any difference whatsoever.
18:32 ^🔗	ersi	Exactly, so it's completely off-topic. Also, it makes me furiously mad that this stupid subject gets drawn up so much.
18:34 ^🔗	xmc	#internetarchive
18:34 ^🔗	xmc	I completely agree with your sentiment, ersi
18:35 ^🔗	nitro2k01	Do you think it's a stupid subject because it's off-topic to the channel, or a stupid subject in general?
18:38 ^🔗	xmc	it's off-topic to this channel, and discussed way out of proportion to how interesting it is
18:39 ^🔗	ersi	It's also a stupid subject
18:40 ^🔗	ersi	since it's not actually deleted (Ha-HA, you thought IA DELETE things?) but just hidden from people who get their panties in a bunch
18:43 ^🔗	nitro2k01	I'll just wait a few more years and see if they change their policies before they go bankrupt or have a datacenter fire. :p
18:43 ^🔗	xmc	who?
18:43 ^🔗	nitro2k01	IA.
18:44 ^🔗	xmc	why would you wish for either of those things to happen
18:44 ^🔗	nitro2k01	When did I say I did?
18:45 ^🔗	ersi	You kinda indicated you wanted/wished for it, by the way you said it
18:45 ^🔗	nitro2k01	No.
18:45 ^🔗	ersi	Try doing what IA does, in the United states of lawsuits.
18:46 ^🔗	ersi	It's not going to be all archiving. Or fun..
18:46 ^🔗	xmc	IA is coy about not actually deleting things on robots.txt exactly because it tends to deter lawsuit people
18:47 ^🔗	ersi	and that's also why it's not written down specifically
18:47 ^🔗	nitro2k01	One way to resolve it is to ignore robots.txt for historical content, only for domains that now belong to domain nappers.
18:48 ^🔗	nitro2k01	But ok, sure.
18:49 ^🔗	ersi	For what I've experienced, it doesn't help to even discuss these things. So let's not continue discussing this. Unless there's something you need help archiving, because it's not available on IA (which would be OK to talk about, even since it's still fucking archived just that you can't watch it).
18:49 ^🔗	ersi	So let's talk about something that doesn't make me ban people, because that'd keep upsetting people.
18:50 ^🔗	xmc	I'd support a ban policy for people who push this issue in #archiveteam
18:50 ^🔗	ersi	Especially for repeat-offenders.
18:50 ^🔗	xmc	"archiveteam: we're not archive.org"
18:50 ^🔗	nitro2k01	/topic
18:50 ^🔗	ersi	That's true.
18:52 ^🔗	xmc	done
18:52 ^🔗	nitro2k01	So, to discuss something that is on-topic. I asked someone to archive Rocketboom's videos, because they were announcing the videos were going to be deleted.
18:52 ^🔗	ersi	damn you efnet
18:52 ^🔗	nitro2k01	If the nick limit wasn't bad enough...
18:52 ^🔗	xmc	try removing the thefacebook url
18:53 ^🔗	nitro2k01	What's the general process after something has been archived? Torrent?
18:53 ^🔗	ersi	I'll leave it be. It's (FB) popular amongst the kids and what not
18:53 ^🔗	xmc	nitro2k01: who did you ask to do it?
18:53 ^🔗	nitro2k01	Let me check my logs.
18:54 ^🔗	nitro2k01	midas:
18:54 ^🔗	zenguy_pc	anyone archiving reelradio or whatever service th e riaa is targeting?
18:57 ^🔗	garyrh	zenguy_pc, there was a archivebot task for grabbing whatever wasn't behind a paywall, but i think it was aborted for some reason.
18:57 ^🔗	xmc	oh, istr it got stuck
18:58 ^🔗	garyrh	<yipdw> 814336k0tl443ilaam07k6u05 failed; reelradio.com can be requeued whenever
19:00 ^🔗	yipdw	yeah, it was on a DO node that I ran
19:00 ^🔗	yipdw	job got stuck on an empty reply, which is odd
19:32 ^🔗	Nemo_bis	s/lengthy\/off-topic/lengthy\/off-topic\/robots.txt/
19:35 ^🔗	godane	so sockington has a wikipedia page: http://en.wikipedia.org/wiki/Sockington
19:35 ^🔗	xmc	you mean s,lengthy/off-topic,lengthy/off-topic/robots.txt,
19:36 ^🔗	xmc	Nemo_bis: we can just respond with "that topic has been already deemed to be off-topic"
19:52 ^🔗	garyrh	perhaps some sort of note about this should added to the wiki, as the petition Dec-31-99 started explicitly links it
19:53 ^🔗	garyrh	s/it/to it/

irclogger-viewer