#archiveteam 2012-07-25,Wed

↑back Search

Time Nickname Message
02:16 🔗 z_ http://i.imgur.com/Am19f.jpg
02:16 🔗 kennethre z_: beautiful
02:17 🔗 z_ http://i.imgur.com/zGpiJ.jpg
02:17 🔗 z_ 40Gbps routing, or something like that
02:17 🔗 z_ Ralf Muehlen let me take pictures, lol
02:18 🔗 z_ SketchCow: did you ever visit the SF location
02:19 🔗 z_ http://i.imgur.com/DWF0Z.jpg
02:20 🔗 z_ http://i.imgur.com/AGYuh.jpg
02:21 🔗 shaqfu Three cheers for server porn
02:22 🔗 z_ http://i.imgur.com/CdX4m.jpg
03:17 🔗 Nintendud z_: ohhhh baby. nice pictures.
03:36 🔗 z_ :)
03:39 🔗 chronomex woop woop woop off-topic siren
03:43 🔗 Nintendud Oops
04:06 🔗 Aranje z_:) fucking hotness.
07:23 🔗 SmileyG storage arrays accessed from the back o_O?
07:23 🔗 SmileyG or they accessed both sides?
13:46 🔗 CoJaBo-Az Am I the only one who has noticed/cared that DomainSponsor.com is erasing the Internet Archive?
13:47 🔗 CoJaBo 135,000 individual domains so far, and counting.
14:14 🔗 CoJaBo wonder how many gigabytes that is :/
14:14 🔗 DFJustin well IA does't actually erase it, just disables access
14:14 🔗 DFJustin so they could undo it
14:15 🔗 balrog_ DFJustin: does IA store snapshots of robots.txt files?
14:15 🔗 balrog_ if not, they should (hope SketchCow or anyone who has connections with the people who run IA hears this)
14:16 🔗 DFJustin beats me, I bet they do though
14:16 🔗 balrog_ if they do, they could use that data in cases like this one
14:17 🔗 balrog_ CoJaBo: they're using robots.txt right?
14:17 🔗 CoJaBo Yes; the lines copied from the removal FAQ
14:18 🔗 CoJaBo I'm not sure what the solution would be; I emailed domainsponsor 3 times now, no response.
14:18 🔗 balrog_ also, can someone clarify whether does does really delete?
14:18 🔗 balrog_ imho, deletion should require contacting IA and requesting
14:18 🔗 balrog_ but that's up to the IA administration
14:19 🔗 balrog_ I'm just a bit worried, since while DFJustin says it just disables access, the FAQ says "It will remove documents from your domain from the Wayback Machine."
14:19 🔗 nitro2k01 I'm not really sure how these things, but from past experience, it seems like a current deny line in robots.txt disables access to all previous snapshots
14:19 🔗 nitro2k01 Maybe
14:19 🔗 balrog_ question is whether it disables or deletes
14:20 🔗 nitro2k01 Which is annoying
14:20 🔗 balrog_ deletion would be bad
14:20 🔗 CoJaBo I posted a thread aboutt it I meant to post:
14:20 🔗 CoJaBo er,
14:20 🔗 balrog_ yeah I found your thread
14:20 🔗 CoJaBo It wont paste the link, thats strange
14:20 🔗 balrog_ http://webdev.archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
14:20 🔗 CoJaBo Yeh
14:22 🔗 nitro2k01 Domainsponsor might even have added those lines in good faith, to avoid IA from being clogged by placeholder pages
14:22 🔗 balrog_ why isn't there an option to remove only future ia_archiver crawls?
14:22 🔗 balrog_ but keep old ones
14:22 🔗 balrog_ this is something IA should do
14:22 🔗 CoJaBo balrog_: There is
14:22 🔗 nitro2k01 That shouldn't be an option but the default
14:22 🔗 CoJaBo Just block robots normally.
14:23 🔗 balrog_ yeah, that should be the default
14:23 🔗 balrog_ CoJaBo: hmm?
14:23 🔗 nitro2k01 I DON'T CARE THAT IT'S SUMMER! FIX THIS NAOOO!!1one
14:23 🔗 balrog_ if you block ia_archiver in robots, it will also block old versions of the site
14:23 🔗 CoJaBo The FAQ page clearly states it will remove prior copies tho; also, I've emailed DS several times too and they don't seem to care.
14:24 🔗 CoJaBo I'm not sure where else they'd've found those lines except on the FAQ page
14:25 🔗 Frigolit oh, that is really bad
14:26 🔗 nitro2k01 At least they offer what you need, when you need it
14:26 🔗 CoJaBo I couldn't find stats on what percentage it is, but *every* expired domain I've attempted to access over the past few years has always, always failed with that error.
14:27 🔗 CoJaBo 135,000 is a pretty huge number of domains, and they seem to specificly target those with the most prior traffic (i.e., those that users are most likely to want to read from the archive).
14:30 🔗 CoJaBo .....actually, according to their nameservers, that number appears to have climbed to 2,225,700 now
14:32 🔗 CoJaBo Which is a pretty dramatic increase in just a few months, asssuming the previous number wasn't just missing a lot of em....
15:12 🔗 Schbirid does anyone know a good robots.txt parser?
15:13 🔗 Lord_Nigh cat?
15:24 🔗 Schbirid parser
15:27 🔗 balrog_ there's one in python
15:27 🔗 balrog_ http://docs.python.org/library/robotparser.html
15:33 🔗 ersi_ Ahhh, nagging and wanking about the robots.txt file
15:33 🔗 ersi_ How I love this conversation
15:36 🔗 ersi_ CoJaBo: You know that IA rarely *actually* deletes stuff, right? They often 'dark' things though. ie. the content is obviously still at the IA, but you can't access anonymously
15:41 🔗 CoJaBo ersi_: Forr all practical intents and purposes, its gone tho. Theres no way to access it, and nobody seems to care.
15:42 🔗 ersi_ Yeah, I certainly don't either. This isn't #internetarchive either though
15:42 🔗 CoJaBo The official fforums is all they have, and they're 90% spam
15:42 🔗 ersi Sorry for the strong attitude, but people have been bringing this up - something like, oh.. about all the time.
15:42 🔗 CoJaBo I tried their contact form too, no response there either.
15:43 🔗 ersi There's no fire.
15:43 🔗 ersi There's no smoke.
15:43 🔗 ersi Calm down.
15:43 🔗 CoJaBo And do what?
15:43 🔗 * SmileyG_ still doesn't actually know what the issue is exactly
15:43 🔗 ersi How about something productive? Like.. download something. Archive something?
15:44 🔗 ersi Collect something? Write metadata describing items you've collected? I dunno, endless possibilities
15:44 🔗 CoJaBo SmileyG_: DomainSponsor erasing two million sites from the Internet Archive.
15:44 🔗 SmileyG erasing, or taking offline?
15:44 🔗 nitro2k01 "There's no fire. There's no smoke." I can't visit sites in IA because a domain napper blocked the IA bot. How is that not fire or smoke?
15:44 🔗 CoJaBo SmileyG: From the end user perspective? Erasing.
15:45 🔗 SmileyG CoJaBo: no, in actual fact reality whats happening.
15:45 🔗 CoJaBo nitro2k01: Its still a Royal PITA tho, isn't it?
15:45 🔗 nitro2k01 Yup
15:45 🔗 ersi CoJaBo: This isn't the Internet Archive.
15:46 🔗 CoJaBo Its the closest I could find tho
15:46 🔗 ersi Okay, you want to do something? Start a crawler and save sites to WARC files.
15:46 🔗 CoJaBo Trying to find a group that uses IA, as maybe they'll have more influence.
15:46 🔗 ersi The more the merrier
15:47 🔗 CoJaBo I don't have nearly enough disks or bandwidth to start my own
15:47 🔗 SmileyG pffffffft
15:47 🔗 SmileyG at least my excuse is reasonable
15:47 🔗 ersi I can recommend wget (which has support to write to WARC, if you have a recent version). Or use tef's crawler ( http://github.com/tef/crawler )
15:47 🔗 ersi CoJaBo: Get out
15:47 🔗 SmileyG ersi: plz carry on, maybe you'll give me a hint on what I do next...
15:48 🔗 ersi You know you can always start something?!
15:48 🔗 ersi even if you don't have 2PB of disk
15:48 🔗 * SmileyG wants to save a specific forum he uses
15:48 🔗 ersi SmileyG: I'd say, try to save that then!
15:48 🔗 SmileyG ersi: it sounds like a good plan right?
15:48 🔗 ersi Yeah, totally :-)!
15:49 🔗 SmileyG Except I'm kind of confused about what to do next.
15:49 🔗 SmileyG I have the seesaw and all those tools (the special wget)
15:49 🔗 CoJaBo SmileyG: A little wget magic :P
15:50 🔗 CoJaBo Its trivial if you already know what you want archived lol.. Not so much when you just want too browse the web without running into dead links that really are archived somewhere, but cannot be accessed by anyone -_-'
15:51 🔗 Schbirid ersi: if this is so annoying and bloodboiling for you, make a "FAQ" on the wiki or so?
15:51 🔗 CoJaBo ..actually, I wonder if theres a list of expired domains that DS uses.. That'd actually work if I could just downnload them before they nuked everything...
15:51 🔗 Schbirid CoJaBo: you can see domain expiry dates in the whois. companies scan that.
15:52 🔗 CoJaBo Cept IA probably doesn't permit crawling their own site, and I can't know a sites expired till it actually is anyway.. gah.
15:53 🔗 SmileyG Schbirid: zomg idea
15:53 🔗 SmileyG "Warrior" project that has a list of "soon expiring" domains
15:53 🔗 SmileyG hand them out to idle warrior clients
15:53 🔗 Schbirid CoJaBo: whois is a domain thing. try it. open a terminal, "whois archive.org" for example
15:53 🔗 Schbirid SmileyG: have fun with 29349817923816948 spam domains per day
15:54 🔗 CoJaBo Schbirid: I know what it is; the problem is getting the ones that DS is going after. To do that, you also need search rankings.
15:54 🔗 Schbirid also, domain expiry has not much to do with a site shutting down
15:54 🔗 Schbirid you could try the alexa toplist or the one from compete (or was it quantcast)
15:55 🔗 SmileyG Schbirid: ok, have them "not" so idle .... display a list of domains due to explire and let them see if they want to grab any....
15:55 🔗 CoJaBo Plus, a good many of them lose their hosting before the expiry; knowing when its going to expire is no good if the host is already gone :/
15:55 🔗 Schbirid SmileyG: you'll be the one to do that :P
15:55 🔗 SmileyG :)
15:56 🔗 SmileyG so how do I go about even thinking about backing up this site?
15:57 🔗 Schbirid what site?
15:57 🔗 SmileyG http://www.gamestm.co.uk/forum/
15:57 🔗 Schbirid yay
15:57 🔗 SmileyG yey?
15:58 🔗 Schbirid please document your thoughts on the wiki, forums are a pain in the ass and it would be great if we started making notes how to mirror them properly
15:58 🔗 SmileyG I have no thoughts yet. I have no *clue* where to start :S
15:58 🔗 Schbirid wget -m -np http://www.gamestm.co.uk/forum/
15:59 🔗 Schbirid -a logfile.log
15:59 🔗 SmileyG looks fun.
16:00 🔗 Schbirid this is my normal starting point "wget -a URL_DATE.log -nv --adjust-extension --convert-links --page-requisites --span-hosts -D domains.that,it.should.span.com -m -np --warc-file=DOMAINURL_DATE URL"
16:00 🔗 Schbirid i usually include a user-agent with my mail address and saying that i want to archive it because i am so nice
16:01 🔗 SmileyG o_O
16:05 🔗 SmileyG well this is working :D
16:05 🔗 SmileyG Once its done, can I easily WARC it? Or should i be doing that from the very start?
16:06 🔗 Schbirid from the very start
16:06 🔗 * SmileyG aborts
16:06 🔗 Schbirid you need a wget that does warc of course
16:07 🔗 SmileyG wget-warc
16:07 🔗 SmileyG :D
16:07 🔗 SmileyG pulling it from mobileme.
16:08 🔗 SmileyG Right
16:08 🔗 SmileyG off home.
16:08 🔗 SmileyG o/
16:16 🔗 ersi SmileyG: You should always do that from the start :)
16:17 🔗 yipdw CoJaBo: have you sent that information to IA staff, or is it just on their forums?
16:18 🔗 yipdw I'd like to know what "removing pages from the Wayback Machine" means before getting winded up about it
16:19 🔗 ersi It doesn't mean anything, he's just being boring
16:19 🔗 yipdw if that means "irrevocably deleting data" then I think there's a case to be made, but if it just means "revoking public access" then, yes, it can be a problem but it's not one that can't be calmly solved
16:19 🔗 ersi It only means that if you input one of the domains which now have a robots.txt - you won't get any results from the Wayback machine.
16:19 🔗 ersi The content is still at IA though
16:19 🔗 yipdw (given the people who work at IA, I would be surprised if it's the former)
16:19 🔗 ersi IA doesn't delete
16:20 🔗 yipdw ersi: that's my expectation. however I'd just like confirmation from IA about that
17:30 🔗 nitro2k01 I'm sure the issue can be calmly resolved, if someone finds a way to actually reach the people at IA
17:31 🔗 nitro2k01 I'm sure it's a policy issue that needs to be thoroughly discussed etc, and maybe it will be calmly resolved in 2018
18:50 🔗 CoJaBo yipdw: I sent it to one of their contact forms; no reply. Its also in their forums.
18:53 🔗 CoJaBo yipdw: The issue, if it isn't clear, is that a large company is buying up expired domains (two million and counting), and automatically requesting removal of all prior content on those domains (content that they clearly have no rights to).
18:53 🔗 CoJaBo The problem is, noone who is actually able to resolve the issue, either at IA or DS, appears to be reachable.
19:00 🔗 yipdw CoJaBo: I got that part; the bit that I'd like to know is whether the data is actually still around
19:00 🔗 yipdw the rest can be worked out in time
19:05 🔗 CoJaBo yipdw: I'd imagine it is still available, due to the chance for errorneous removals, tho its possible they could actually erase it after some grace period (I first posted about it 2 or 3 years ago I think). The FAQs aren't clear, it just says its removed from the Wayback machine.
19:07 🔗 CoJaBo There isn't much point of it still being around tho if theres no way to actually contact either IA or DS; most of the response to other places I've brought it up were very negative ("who cares, the Archive is just copyright infringement anyway")
19:09 🔗 SmileyG i don't understand what buying the domain name has to do with content that was previously hosted on said domain
19:09 🔗 SmileyG infact it has _NOTHING_ to do with it.
19:09 🔗 SmileyG other than the domain won't point to it anymore.
19:14 🔗 CoJaBo SmileyG: Domain expires -> Domain is judged to have commercial value by DomainSponsor's "patented algorithms" -> DomainSponsor's systems automatically register the domain -> DomainSponsor hosts a robots.txt on that domain with the lines, specified in the removal FAQ, to remove all content (including prior content) from the archive -> Regardless of whether or not it was deleted, there is no longer a way for users to access any conten
19:15 🔗 SmileyG you miss understand how domains work
19:15 🔗 SmileyG domain explires, they buy it, they point it to their name servers; the end.
19:16 🔗 SmileyG unless IA scans existing domains and then removes content due to that, which is extremely stupid?
19:18 🔗 CoJaBo .....er, yes.... When IA's crawler revisits the domain, now in possession of DomainSponsor and hosting a "parking page", it sees that robots.txt and disables access to any prior pages on that domain (the parking page, of course, but also that of the prior owner).
19:18 🔗 chronomex IA does not delete data.
19:18 🔗 SmileyG thats a bit stupid :/
19:18 🔗 SmileyG chronomex: yah ok, they block it. but thats errrm retarted
19:18 🔗 chronomex by design.
19:18 🔗 SmileyG I could buy any old domain and block it.
19:19 🔗 chronomex it's stupid, but it's the least stupid of most alternatives
19:19 🔗 SmileyG least stupid : don't do anyhting.
19:19 🔗 CoJaBo chronomex: I'm using "delete" from the end-users' perspective. The data is gone, unless you happen to work at IA.
19:19 🔗 SmileyG whois the domain; oh look its changed, ignore them.
19:19 🔗 SmileyG or "If this is your content, let us know"...
19:19 🔗 Carray is justin bieber going to see selena gomez naked?
19:20 🔗 CoJaBo ...yes, thank you for that visual
19:20 🔗 * CoJaBo barfs
19:20 🔗 SmileyG CoJaBo: ok, its stupid but errrm bugger all you can do about it because its idiotic.
19:28 🔗 ersi Being retarded about Internet Archive really belongs in #archiveteam-bs - the off-topic chat
19:29 🔗 SmileyG And with that I'm going to be quiet again :D
19:29 🔗 ersi Not that strange, no projects really going on.
19:30 🔗 ersi I'm playing around with tef's crawler and hanzotools
23:54 🔗 Carray is justin bieber going to see selena gomez naked?

irclogger-viewer