#archiveteam 2012-07-25,Wed

↑back Search

Time	Nickname	Message
02:16 ^🔗	z_	http://i.imgur.com/Am19f.jpg
02:16 ^🔗	kennethre	z_: beautiful
02:17 ^🔗	z_	http://i.imgur.com/zGpiJ.jpg
02:17 ^🔗	z_	40Gbps routing, or something like that
02:17 ^🔗	z_	Ralf Muehlen let me take pictures, lol
02:18 ^🔗	z_	SketchCow: did you ever visit the SF location
02:19 ^🔗	z_	http://i.imgur.com/DWF0Z.jpg
02:20 ^🔗	z_	http://i.imgur.com/AGYuh.jpg
02:21 ^🔗	shaqfu	Three cheers for server porn
02:22 ^🔗	z_	http://i.imgur.com/CdX4m.jpg
03:17 ^🔗	Nintendud	z_: ohhhh baby. nice pictures.
03:36 ^🔗	z_	:)
03:39 ^🔗	chronomex	woop woop woop off-topic siren
03:43 ^🔗	Nintendud	Oops
04:06 ^🔗	Aranje	z_:) fucking hotness.
07:23 ^🔗	SmileyG	storage arrays accessed from the back o_O?
07:23 ^🔗	SmileyG	or they accessed both sides?
13:46 ^🔗	CoJaBo-Az	Am I the only one who has noticed/cared that DomainSponsor.com is erasing the Internet Archive?
13:47 ^🔗	CoJaBo	135,000 individual domains so far, and counting.
14:14 ^🔗	CoJaBo	wonder how many gigabytes that is :/
14:14 ^🔗	DFJustin	well IA does't actually erase it, just disables access
14:14 ^🔗	DFJustin	so they could undo it
14:15 ^🔗	balrog_	DFJustin: does IA store snapshots of robots.txt files?
14:15 ^🔗	balrog_	if not, they should (hope SketchCow or anyone who has connections with the people who run IA hears this)
14:16 ^🔗	DFJustin	beats me, I bet they do though
14:16 ^🔗	balrog_	if they do, they could use that data in cases like this one
14:17 ^🔗	balrog_	CoJaBo: they're using robots.txt right?
14:17 ^🔗	CoJaBo	Yes; the lines copied from the removal FAQ
14:18 ^🔗	CoJaBo	I'm not sure what the solution would be; I emailed domainsponsor 3 times now, no response.
14:18 ^🔗	balrog_	also, can someone clarify whether does does really delete?
14:18 ^🔗	balrog_	imho, deletion should require contacting IA and requesting
14:18 ^🔗	balrog_	but that's up to the IA administration
14:19 ^🔗	balrog_	I'm just a bit worried, since while DFJustin says it just disables access, the FAQ says "It will remove documents from your domain from the Wayback Machine."
14:19 ^🔗	nitro2k01	I'm not really sure how these things, but from past experience, it seems like a current deny line in robots.txt disables access to all previous snapshots
14:19 ^🔗	nitro2k01	Maybe
14:19 ^🔗	balrog_	question is whether it disables or deletes
14:20 ^🔗	nitro2k01	Which is annoying
14:20 ^🔗	balrog_	deletion would be bad
14:20 ^🔗	CoJaBo	I posted a thread aboutt it I meant to post:
14:20 ^🔗	CoJaBo	er,
14:20 ^🔗	balrog_	yeah I found your thread
14:20 ^🔗	CoJaBo	It wont paste the link, thats strange
14:20 ^🔗	balrog_	http://webdev.archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
14:20 ^🔗	CoJaBo	Yeh
14:22 ^🔗	nitro2k01	Domainsponsor might even have added those lines in good faith, to avoid IA from being clogged by placeholder pages
14:22 ^🔗	balrog_	why isn't there an option to remove only future ia_archiver crawls?
14:22 ^🔗	balrog_	but keep old ones
14:22 ^🔗	balrog_	this is something IA should do
14:22 ^🔗	CoJaBo	balrog_: There is
14:22 ^🔗	nitro2k01	That shouldn't be an option but the default
14:22 ^🔗	CoJaBo	Just block robots normally.
14:23 ^🔗	balrog_	yeah, that should be the default
14:23 ^🔗	balrog_	CoJaBo: hmm?
14:23 ^🔗	nitro2k01	I DON'T CARE THAT IT'S SUMMER! FIX THIS NAOOO!!1one
14:23 ^🔗	balrog_	if you block ia_archiver in robots, it will also block old versions of the site
14:23 ^🔗	CoJaBo	The FAQ page clearly states it will remove prior copies tho; also, I've emailed DS several times too and they don't seem to care.
14:24 ^🔗	CoJaBo	I'm not sure where else they'd've found those lines except on the FAQ page
14:25 ^🔗	Frigolit	oh, that is really bad
14:26 ^🔗	nitro2k01	At least they offer what you need, when you need it
14:26 ^🔗	CoJaBo	I couldn't find stats on what percentage it is, but every expired domain I've attempted to access over the past few years has always, always failed with that error.
14:27 ^🔗	CoJaBo	135,000 is a pretty huge number of domains, and they seem to specificly target those with the most prior traffic (i.e., those that users are most likely to want to read from the archive).
14:30 ^🔗	CoJaBo	.....actually, according to their nameservers, that number appears to have climbed to 2,225,700 now
14:32 ^🔗	CoJaBo	Which is a pretty dramatic increase in just a few months, asssuming the previous number wasn't just missing a lot of em....
15:12 ^🔗	Schbirid	does anyone know a good robots.txt parser?
15:13 ^🔗	Lord_Nigh	cat?
15:24 ^🔗	Schbirid	parser
15:27 ^🔗	balrog_	there's one in python
15:27 ^🔗	balrog_	http://docs.python.org/library/robotparser.html
15:33 ^🔗	ersi_	Ahhh, nagging and wanking about the robots.txt file
15:33 ^🔗	ersi_	How I love this conversation
15:36 ^🔗	ersi_	CoJaBo: You know that IA rarely actually deletes stuff, right? They often 'dark' things though. ie. the content is obviously still at the IA, but you can't access anonymously
15:41 ^🔗	CoJaBo	ersi_: Forr all practical intents and purposes, its gone tho. Theres no way to access it, and nobody seems to care.
15:42 ^🔗	ersi_	Yeah, I certainly don't either. This isn't #internetarchive either though
15:42 ^🔗	CoJaBo	The official fforums is all they have, and they're 90% spam
15:42 ^🔗	ersi	Sorry for the strong attitude, but people have been bringing this up - something like, oh.. about all the time.
15:42 ^🔗	CoJaBo	I tried their contact form too, no response there either.
15:43 ^🔗	ersi	There's no fire.
15:43 ^🔗	ersi	There's no smoke.
15:43 ^🔗	ersi	Calm down.
15:43 ^🔗	CoJaBo	And do what?
15:43 ^🔗	*	SmileyG_ still doesn't actually know what the issue is exactly
15:43 ^🔗	ersi	How about something productive? Like.. download something. Archive something?
15:44 ^🔗	ersi	Collect something? Write metadata describing items you've collected? I dunno, endless possibilities
15:44 ^🔗	CoJaBo	SmileyG_: DomainSponsor erasing two million sites from the Internet Archive.
15:44 ^🔗	SmileyG	erasing, or taking offline?
15:44 ^🔗	nitro2k01	"There's no fire. There's no smoke." I can't visit sites in IA because a domain napper blocked the IA bot. How is that not fire or smoke?
15:44 ^🔗	CoJaBo	SmileyG: From the end user perspective? Erasing.
15:45 ^🔗	SmileyG	CoJaBo: no, in actual fact reality whats happening.
15:45 ^🔗	CoJaBo	nitro2k01: Its still a Royal PITA tho, isn't it?
15:45 ^🔗	nitro2k01	Yup
15:45 ^🔗	ersi	CoJaBo: This isn't the Internet Archive.
15:46 ^🔗	CoJaBo	Its the closest I could find tho
15:46 ^🔗	ersi	Okay, you want to do something? Start a crawler and save sites to WARC files.
15:46 ^🔗	CoJaBo	Trying to find a group that uses IA, as maybe they'll have more influence.
15:46 ^🔗	ersi	The more the merrier
15:47 ^🔗	CoJaBo	I don't have nearly enough disks or bandwidth to start my own
15:47 ^🔗	SmileyG	pffffffft
15:47 ^🔗	SmileyG	at least my excuse is reasonable
15:47 ^🔗	ersi	I can recommend wget (which has support to write to WARC, if you have a recent version). Or use tef's crawler ( http://github.com/tef/crawler )
15:47 ^🔗	ersi	CoJaBo: Get out
15:47 ^🔗	SmileyG	ersi: plz carry on, maybe you'll give me a hint on what I do next...
15:48 ^🔗	ersi	You know you can always start something?!
15:48 ^🔗	ersi	even if you don't have 2PB of disk
15:48 ^🔗	*	SmileyG wants to save a specific forum he uses
15:48 ^🔗	ersi	SmileyG: I'd say, try to save that then!
15:48 ^🔗	SmileyG	ersi: it sounds like a good plan right?
15:48 ^🔗	ersi	Yeah, totally :-)!
15:49 ^🔗	SmileyG	Except I'm kind of confused about what to do next.
15:49 ^🔗	SmileyG	I have the seesaw and all those tools (the special wget)
15:49 ^🔗	CoJaBo	SmileyG: A little wget magic :P
15:50 ^🔗	CoJaBo	Its trivial if you already know what you want archived lol.. Not so much when you just want too browse the web without running into dead links that really are archived somewhere, but cannot be accessed by anyone -_-'
15:51 ^🔗	Schbirid	ersi: if this is so annoying and bloodboiling for you, make a "FAQ" on the wiki or so?
15:51 ^🔗	CoJaBo	..actually, I wonder if theres a list of expired domains that DS uses.. That'd actually work if I could just downnload them before they nuked everything...
15:51 ^🔗	Schbirid	CoJaBo: you can see domain expiry dates in the whois. companies scan that.
15:52 ^🔗	CoJaBo	Cept IA probably doesn't permit crawling their own site, and I can't know a sites expired till it actually is anyway.. gah.
15:53 ^🔗	SmileyG	Schbirid: zomg idea
15:53 ^🔗	SmileyG	"Warrior" project that has a list of "soon expiring" domains
15:53 ^🔗	SmileyG	hand them out to idle warrior clients
15:53 ^🔗	Schbirid	CoJaBo: whois is a domain thing. try it. open a terminal, "whois archive.org" for example
15:53 ^🔗	Schbirid	SmileyG: have fun with 29349817923816948 spam domains per day
15:54 ^🔗	CoJaBo	Schbirid: I know what it is; the problem is getting the ones that DS is going after. To do that, you also need search rankings.
15:54 ^🔗	Schbirid	also, domain expiry has not much to do with a site shutting down
15:54 ^🔗	Schbirid	you could try the alexa toplist or the one from compete (or was it quantcast)
15:55 ^🔗	SmileyG	Schbirid: ok, have them "not" so idle .... display a list of domains due to explire and let them see if they want to grab any....
15:55 ^🔗	CoJaBo	Plus, a good many of them lose their hosting before the expiry; knowing when its going to expire is no good if the host is already gone :/
15:55 ^🔗	Schbirid	SmileyG: you'll be the one to do that :P
15:55 ^🔗	SmileyG	:)
15:56 ^🔗	SmileyG	so how do I go about even thinking about backing up this site?
15:57 ^🔗	Schbirid	what site?
15:57 ^🔗	SmileyG	http://www.gamestm.co.uk/forum/
15:57 ^🔗	Schbirid	yay
15:57 ^🔗	SmileyG	yey?
15:58 ^🔗	Schbirid	please document your thoughts on the wiki, forums are a pain in the ass and it would be great if we started making notes how to mirror them properly
15:58 ^🔗	SmileyG	I have no thoughts yet. I have no clue where to start :S
15:58 ^🔗	Schbirid	wget -m -np http://www.gamestm.co.uk/forum/
15:59 ^🔗	Schbirid	-a logfile.log
15:59 ^🔗	SmileyG	looks fun.
16:00 ^🔗	Schbirid	this is my normal starting point "wget -a URL_DATE.log -nv --adjust-extension --convert-links --page-requisites --span-hosts -D domains.that,it.should.span.com -m -np --warc-file=DOMAINURL_DATE URL"
16:00 ^🔗	Schbirid	i usually include a user-agent with my mail address and saying that i want to archive it because i am so nice
16:01 ^🔗	SmileyG	o_O
16:05 ^🔗	SmileyG	well this is working :D
16:05 ^🔗	SmileyG	Once its done, can I easily WARC it? Or should i be doing that from the very start?
16:06 ^🔗	Schbirid	from the very start
16:06 ^🔗	*	SmileyG aborts
16:06 ^🔗	Schbirid	you need a wget that does warc of course
16:07 ^🔗	SmileyG	wget-warc
16:07 ^🔗	SmileyG	:D
16:07 ^🔗	SmileyG	pulling it from mobileme.
16:08 ^🔗	SmileyG	Right
16:08 ^🔗	SmileyG	off home.
16:08 ^🔗	SmileyG	o/
16:16 ^🔗	ersi	SmileyG: You should always do that from the start :)
16:17 ^🔗	yipdw	CoJaBo: have you sent that information to IA staff, or is it just on their forums?
16:18 ^🔗	yipdw	I'd like to know what "removing pages from the Wayback Machine" means before getting winded up about it
16:19 ^🔗	ersi	It doesn't mean anything, he's just being boring
16:19 ^🔗	yipdw	if that means "irrevocably deleting data" then I think there's a case to be made, but if it just means "revoking public access" then, yes, it can be a problem but it's not one that can't be calmly solved
16:19 ^🔗	ersi	It only means that if you input one of the domains which now have a robots.txt - you won't get any results from the Wayback machine.
16:19 ^🔗	ersi	The content is still at IA though
16:19 ^🔗	yipdw	(given the people who work at IA, I would be surprised if it's the former)
16:19 ^🔗	ersi	IA doesn't delete
16:20 ^🔗	yipdw	ersi: that's my expectation. however I'd just like confirmation from IA about that
17:30 ^🔗	nitro2k01	I'm sure the issue can be calmly resolved, if someone finds a way to actually reach the people at IA
17:31 ^🔗	nitro2k01	I'm sure it's a policy issue that needs to be thoroughly discussed etc, and maybe it will be calmly resolved in 2018
18:50 ^🔗	CoJaBo	yipdw: I sent it to one of their contact forms; no reply. Its also in their forums.
18:53 ^🔗	CoJaBo	yipdw: The issue, if it isn't clear, is that a large company is buying up expired domains (two million and counting), and automatically requesting removal of all prior content on those domains (content that they clearly have no rights to).
18:53 ^🔗	CoJaBo	The problem is, noone who is actually able to resolve the issue, either at IA or DS, appears to be reachable.
19:00 ^🔗	yipdw	CoJaBo: I got that part; the bit that I'd like to know is whether the data is actually still around
19:00 ^🔗	yipdw	the rest can be worked out in time
19:05 ^🔗	CoJaBo	yipdw: I'd imagine it is still available, due to the chance for errorneous removals, tho its possible they could actually erase it after some grace period (I first posted about it 2 or 3 years ago I think). The FAQs aren't clear, it just says its removed from the Wayback machine.
19:07 ^🔗	CoJaBo	There isn't much point of it still being around tho if theres no way to actually contact either IA or DS; most of the response to other places I've brought it up were very negative ("who cares, the Archive is just copyright infringement anyway")
19:09 ^🔗	SmileyG	i don't understand what buying the domain name has to do with content that was previously hosted on said domain
19:09 ^🔗	SmileyG	infact it has _NOTHING_ to do with it.
19:09 ^🔗	SmileyG	other than the domain won't point to it anymore.
19:14 ^🔗	CoJaBo	SmileyG: Domain expires -> Domain is judged to have commercial value by DomainSponsor's "patented algorithms" -> DomainSponsor's systems automatically register the domain -> DomainSponsor hosts a robots.txt on that domain with the lines, specified in the removal FAQ, to remove all content (including prior content) from the archive -> Regardless of whether or not it was deleted, there is no longer a way for users to access any conten
19:15 ^🔗	SmileyG	you miss understand how domains work
19:15 ^🔗	SmileyG	domain explires, they buy it, they point it to their name servers; the end.
19:16 ^🔗	SmileyG	unless IA scans existing domains and then removes content due to that, which is extremely stupid?
19:18 ^🔗	CoJaBo	.....er, yes.... When IA's crawler revisits the domain, now in possession of DomainSponsor and hosting a "parking page", it sees that robots.txt and disables access to any prior pages on that domain (the parking page, of course, but also that of the prior owner).
19:18 ^🔗	chronomex	IA does not delete data.
19:18 ^🔗	SmileyG	thats a bit stupid :/
19:18 ^🔗	SmileyG	chronomex: yah ok, they block it. but thats errrm retarted
19:18 ^🔗	chronomex	by design.
19:18 ^🔗	SmileyG	I could buy any old domain and block it.
19:19 ^🔗	chronomex	it's stupid, but it's the least stupid of most alternatives
19:19 ^🔗	SmileyG	least stupid : don't do anyhting.
19:19 ^🔗	CoJaBo	chronomex: I'm using "delete" from the end-users' perspective. The data is gone, unless you happen to work at IA.
19:19 ^🔗	SmileyG	whois the domain; oh look its changed, ignore them.
19:19 ^🔗	SmileyG	or "If this is your content, let us know"...
19:19 ^🔗	Carray	is justin bieber going to see selena gomez naked?
19:20 ^🔗	CoJaBo	...yes, thank you for that visual
19:20 ^🔗	*	CoJaBo barfs
19:20 ^🔗	SmileyG	CoJaBo: ok, its stupid but errrm bugger all you can do about it because its idiotic.
19:28 ^🔗	ersi	Being retarded about Internet Archive really belongs in #archiveteam-bs - the off-topic chat
19:29 ^🔗	SmileyG	And with that I'm going to be quiet again :D
19:29 ^🔗	ersi	Not that strange, no projects really going on.
19:30 ^🔗	ersi	I'm playing around with tef's crawler and hanzotools
23:54 ^🔗	Carray	is justin bieber going to see selena gomez naked?

irclogger-viewer