Time |
Nickname |
Message |
00:11
🔗
|
BlueMax |
https://torrentfreak.com/riaa-now-bullying-fully-licensed-zero-revenue-music-site-140719/ this is quite concerning |
00:38
🔗
|
joepie91 |
BlueMax: why aren't these guys branded as mafia yet |
00:39
🔗
|
BlueMax |
yeah :| |
05:23
🔗
|
bsmith093 |
well, i think pidgin is dead for mint. |
05:24
🔗
|
bsmith093 |
i havent been able to get on here for a day... a DAY.. because it decided to die |
05:24
🔗
|
bsmith093 |
any projects going on? |
12:18
🔗
|
Dec-31-99 |
Hey there, folks. https://secure.avaaz.org/en/petition/The_Internet_Archive_Include_Every_Site_on_the_Wayback_Machine_Regardless_of_Robotstxt |
12:21
🔗
|
Dec-31-99 |
Hello? |
12:29
🔗
|
nitro2k01 |
There are two different issues here, crawling a site a where robots.txt disallows it, and storing a site where robots.txt disallows it. |
12:30
🔗
|
nitro2k01 |
Crawling a site in spite of robots.txt is rude and should be avoided. On the other hand, I'm seeing IA removing sites just because the domain name has expired and the new owner, mostly spam landing pages, disallow it. |
12:30
🔗
|
nitro2k01 |
I would argue that the latter is the single biggest threat to historical information availability on IA. |
12:30
🔗
|
nitro2k01 |
I've tried se |
12:30
🔗
|
nitro2k01 |
veral times to contact them about it, but... |
12:34
🔗
|
Dec-31-99 |
What happened? |
12:34
🔗
|
nitro2k01 |
When I contacted them? Nothing, of course. |
12:35
🔗
|
Dec-31-99 |
I worked very hard on explaining to archive.org where missing files on archive.org are supposed to go. |
12:35
🔗
|
Dec-31-99 |
On the donkeykongcountry.com defunct site. No reply. Z_Z |
12:36
🔗
|
Dec-31-99 |
I think someday the Archive Team should start their own Web Archive. |
12:37
🔗
|
Dec-31-99 |
But why do they ignore our want for robots.txt to be demolished? |
12:38
🔗
|
nitro2k01 |
Actually, as far as I know IA ignores robots.txt except when IA_Archiver is explicitly disallowed. |
12:39
🔗
|
nitro2k01 |
But many domain nappers put those two fatal lines into robots.txt |
12:39
🔗
|
nitro2k01 |
Disallow: |
12:39
🔗
|
nitro2k01 |
User-agent: ia_archiver |
12:39
🔗
|
Dec-31-99 |
Or when all robots are disallowed. |
12:39
🔗
|
nitro2k01 |
And boom, the old content is gone from the archive as well. |
12:39
🔗
|
Dec-31-99 |
Like: |
12:39
🔗
|
Dec-31-99 |
User agent: * |
12:39
🔗
|
Dec-31-99 |
Disallow: / |
12:40
🔗
|
nitro2k01 |
No, I don't think it does. Maybe someone can confirm this. |
12:40
🔗
|
Dec-31-99 |
From another user's link post on IA Forums: http://web.archive.org/web/20070103112847/http://www.infoceptor.com/ |
12:41
🔗
|
Dec-31-99 |
and... http://www.infoceptor.com/robots.txt |
12:41
🔗
|
nitro2k01 |
Ok, you're right. |
12:41
🔗
|
Dec-31-99 |
Robots.txt is a recipe for web annihilation! |
12:42
🔗
|
Dec-31-99 |
Or when only specific directories are blocked from all web crawlers from accessing. Like: http://web.archive.org/*/google.com/search |
12:42
🔗
|
Dec-31-99 |
But web.archive.org/*/google.com can be accessed |
12:43
🔗
|
Dec-31-99 |
It's because their robots policy is written to exclude some directories and not all. This is common for many popular sites. |
12:45
🔗
|
Dec-31-99 |
So how are we going to resolve this robots.txt problem? |
12:45
🔗
|
Dec-31-99 |
Every few weeks I cross my fingers that archive.org destroys their robots.txt policy. |
12:46
🔗
|
Dec-31-99 |
But then it doesn't happen!!! *d'ohpalm* |
12:48
🔗
|
Dec-31-99 |
nitro2k01: How are we going to get rid of this issue? I wanted to access nintendo.co.uk's site, but it was excluded entirely per request by site owner... |
12:48
🔗
|
Dec-31-99 |
It's a huge pain to see "Sorry. This url has been excluded from the Wayback Machine.: |
12:49
🔗
|
Dec-31-99 |
The webpage actually gives me a 403 Forbidden, rather than a 404 Not Found. I used Live HTTP Headers to find that out. |
12:49
🔗
|
Dec-31-99 |
So it is hidden in their servers, but they won't show it to the public. |
16:31
🔗
|
chazchaz |
Some people are pretty confused. |
16:33
🔗
|
yipdw |
Dec-31-99 should start a website about The IA Conspiracy |
16:46
🔗
|
xmc |
Dec-31-99 should change their name to Dec-31-69 |
17:05
🔗
|
Nemo_bis |
To confuse UNIX time? |
17:06
🔗
|
xmc |
yea |
17:08
🔗
|
Nemo_bis |
I think their current level of confusion is sufficient |
17:28
🔗
|
nitro2k01 |
It's not a conspiracy, just badly implemented policy. (Deleting old content because of a new robots.txt.) |
17:30
🔗
|
chazchaz |
I don't think it's even deleted, it's just not public |
17:30
🔗
|
chazchaz |
or is that not true? |
17:30
🔗
|
yipdw |
it's not deleted |
17:30
🔗
|
nitro2k01 |
Hopefully not deleted, yes. But that matters less for me since it's now inaccessible for all foreseeable future. |
17:30
🔗
|
chazchaz |
Not that there's a huge practical difference, though |
17:31
🔗
|
yipdw |
sufficiently bad policy is indistinguishable from conspiracy |
17:36
🔗
|
Nemo_bis |
hmm |
17:37
🔗
|
Nemo_bis |
So the existence of capital punishment in USA is a conspiracy? |
17:37
🔗
|
yipdw |
sufficiently dry humor is indistinguishable from literary reference |
17:37
🔗
|
yipdw |
also woop woop woop etc. |
17:42
🔗
|
nitro2k01 |
Maybe we could organize a domain buyout as a protest. Buy the domain one by one from the domain nappers for a few hundred dollars or whatever they charge, then revert the robots.txt and hope that the IA Archiver picks it up and makes the archive available again. |
17:42
🔗
|
nitro2k01 |
I'm joking of course, but if I had a domain that used to contain information I wanted very badly, I might've considered doing that. |
17:43
🔗
|
Nemo_bis |
Might be their business model |
18:23
🔗
|
ersi |
Dec-31-99 should.. stfu |
18:23
🔗
|
ersi |
;o |
18:27
🔗
|
ersi |
nitro2k01: It's not deleted, FYI. It's darked. |
18:28
🔗
|
nitro2k01 |
Ok. |
18:28
🔗
|
ersi |
And if you want to keep talking about this bullshit, #archiveteam-bs is where you should go. |
18:28
🔗
|
ersi |
Or to #carebox |
18:30
🔗
|
nitro2k01 |
It's about the availablility of archived data, so it's not completely off topic. Then again, discussing it here won't make any difference whatsoever. |
18:32
🔗
|
ersi |
Exactly, so it's completely off-topic. Also, it makes me furiously mad that this stupid subject gets drawn up so much. |
18:34
🔗
|
xmc |
#internetarchive |
18:34
🔗
|
xmc |
I completely agree with your sentiment, ersi |
18:35
🔗
|
nitro2k01 |
Do you think it's a stupid subject because it's off-topic to the channel, or a stupid subject in general? |
18:38
🔗
|
xmc |
it's off-topic to *this* channel, and discussed way out of proportion to how interesting it is |
18:39
🔗
|
ersi |
It's also a stupid subject |
18:40
🔗
|
ersi |
since it's not actually deleted (Ha-HA, you thought IA DELETE things?) but just hidden from people who get their panties in a bunch |
18:43
🔗
|
nitro2k01 |
I'll just wait a few more years and see if they change their policies before they go bankrupt or have a datacenter fire. :p |
18:43
🔗
|
xmc |
who? |
18:43
🔗
|
nitro2k01 |
IA. |
18:44
🔗
|
xmc |
why would you wish for either of those things to happen |
18:44
🔗
|
nitro2k01 |
When did I say I did? |
18:45
🔗
|
ersi |
You kinda indicated you wanted/wished for it, by the way you said it |
18:45
🔗
|
nitro2k01 |
No. |
18:45
🔗
|
ersi |
Try doing what IA does, in the United states of lawsuits. |
18:46
🔗
|
ersi |
It's not going to be all archiving. Or fun.. |
18:46
🔗
|
xmc |
IA is coy about not actually deleting things on robots.txt exactly because it tends to deter lawsuit people |
18:47
🔗
|
ersi |
and that's also why it's not written down specifically |
18:47
🔗
|
nitro2k01 |
One way to resolve it is to ignore robots.txt for historical content, only for domains that now belong to domain nappers. |
18:48
🔗
|
nitro2k01 |
But ok, sure. |
18:49
🔗
|
ersi |
For what I've experienced, it doesn't help to even discuss these things. So let's not continue discussing this. Unless there's something you need help archiving, because it's not available on IA (which would be OK to talk about, even since it's still fucking archived just that you can't watch it). |
18:49
🔗
|
ersi |
So let's talk about something that doesn't make me ban people, because that'd keep upsetting people. |
18:50
🔗
|
xmc |
I'd support a ban policy for people who push this issue in #archiveteam |
18:50
🔗
|
ersi |
Especially for repeat-offenders. |
18:50
🔗
|
xmc |
"archiveteam: we're not archive.org" |
18:50
🔗
|
nitro2k01 |
/topic |
18:50
🔗
|
ersi |
That's true. |
18:52
🔗
|
xmc |
done |
18:52
🔗
|
nitro2k01 |
So, to discuss something that is on-topic. I asked someone to archive Rocketboom's videos, because they were announcing the videos were going to be deleted. |
18:52
🔗
|
ersi |
damn you efnet |
18:52
🔗
|
nitro2k01 |
If the nick limit wasn't bad enough... |
18:52
🔗
|
xmc |
try removing the thefacebook url |
18:53
🔗
|
nitro2k01 |
What's the general process after something has been archived? Torrent? |
18:53
🔗
|
ersi |
I'll leave it be. It's (FB) popular amongst the kids and what not |
18:53
🔗
|
xmc |
nitro2k01: who did you ask to do it? |
18:53
🔗
|
nitro2k01 |
Let me check my logs. |
18:54
🔗
|
nitro2k01 |
midas: |
18:54
🔗
|
zenguy_pc |
anyone archiving reelradio or whatever service th e riaa is targeting? |
18:57
🔗
|
garyrh |
zenguy_pc, there was a archivebot task for grabbing whatever *wasn't* behind a paywall, but i think it was aborted for some reason. |
18:57
🔗
|
xmc |
oh, istr it got stuck |
18:58
🔗
|
garyrh |
<yipdw> 814336k0tl443ilaam07k6u05 failed; reelradio.com can be requeued whenever |
19:00
🔗
|
yipdw |
yeah, it was on a DO node that I ran |
19:00
🔗
|
yipdw |
job got stuck on an empty reply, which is odd |
19:32
🔗
|
Nemo_bis |
s/lengthy\/off-topic/lengthy\/off-topic\/robots.txt/ |
19:35
🔗
|
godane |
so sockington has a wikipedia page: http://en.wikipedia.org/wiki/Sockington |
19:35
🔗
|
xmc |
you mean s,lengthy/off-topic,lengthy/off-topic/robots.txt, |
19:36
🔗
|
xmc |
Nemo_bis: we can just respond with "that topic has been already deemed to be off-topic" |
19:52
🔗
|
garyrh |
perhaps some sort of note about this should added to the wiki, as the petition Dec-31-99 started explicitly links it |
19:53
🔗
|
garyrh |
s/it/to it/ |