[02:16] http://i.imgur.com/Am19f.jpg [02:16] z_: beautiful [02:17] http://i.imgur.com/zGpiJ.jpg [02:17] 40Gbps routing, or something like that [02:17] Ralf Muehlen let me take pictures, lol [02:18] SketchCow: did you ever visit the SF location [02:19] http://i.imgur.com/DWF0Z.jpg [02:20] http://i.imgur.com/AGYuh.jpg [02:21] Three cheers for server porn [02:22] http://i.imgur.com/CdX4m.jpg [03:17] z_: ohhhh baby. nice pictures. [03:36] :) [03:39] woop woop woop off-topic siren [03:43] Oops [04:06] z_:) fucking hotness. [07:23] storage arrays accessed from the back o_O? [07:23] or they accessed both sides? [13:46] Am I the only one who has noticed/cared that DomainSponsor.com is erasing the Internet Archive? [13:47] 135,000 individual domains so far, and counting. [14:14] wonder how many gigabytes that is :/ [14:14] well IA does't actually erase it, just disables access [14:14] so they could undo it [14:15] DFJustin: does IA store snapshots of robots.txt files? [14:15] if not, they should (hope SketchCow or anyone who has connections with the people who run IA hears this) [14:16] beats me, I bet they do though [14:16] if they do, they could use that data in cases like this one [14:17] CoJaBo: they're using robots.txt right? [14:17] Yes; the lines copied from the removal FAQ [14:18] I'm not sure what the solution would be; I emailed domainsponsor 3 times now, no response. [14:18] also, can someone clarify whether does does really delete? [14:18] imho, deletion should require contacting IA and requesting [14:18] but that's up to the IA administration [14:19] I'm just a bit worried, since while DFJustin says it just disables access, the FAQ says "It will remove documents from your domain from the Wayback Machine." [14:19] I'm not really sure how these things, but from past experience, it seems like a current deny line in robots.txt disables access to all previous snapshots [14:19] Maybe [14:19] question is whether it disables or deletes [14:20] Which is annoying [14:20] deletion would be bad [14:20] I posted a thread aboutt it I meant to post: [14:20] er, [14:20] yeah I found your thread [14:20] It wont paste the link, thats strange [14:20] http://webdev.archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains [14:20] Yeh [14:22] Domainsponsor might even have added those lines in good faith, to avoid IA from being clogged by placeholder pages [14:22] why isn't there an option to remove only future ia_archiver crawls? [14:22] but keep old ones [14:22] this is something IA should do [14:22] balrog_: There is [14:22] That shouldn't be an option but the default [14:22] Just block robots normally. [14:23] yeah, that should be the default [14:23] CoJaBo: hmm? [14:23] I DON'T CARE THAT IT'S SUMMER! FIX THIS NAOOO!!1one [14:23] if you block ia_archiver in robots, it will also block old versions of the site [14:23] The FAQ page clearly states it will remove prior copies tho; also, I've emailed DS several times too and they don't seem to care. [14:24] I'm not sure where else they'd've found those lines except on the FAQ page [14:25] oh, that is really bad [14:26] At least they offer what you need, when you need it [14:26] I couldn't find stats on what percentage it is, but *every* expired domain I've attempted to access over the past few years has always, always failed with that error. [14:27] 135,000 is a pretty huge number of domains, and they seem to specificly target those with the most prior traffic (i.e., those that users are most likely to want to read from the archive). [14:30] .....actually, according to their nameservers, that number appears to have climbed to 2,225,700 now [14:32] Which is a pretty dramatic increase in just a few months, asssuming the previous number wasn't just missing a lot of em.... [15:12] does anyone know a good robots.txt parser? [15:13] cat? [15:24] parser [15:27] there's one in python [15:27] http://docs.python.org/library/robotparser.html [15:33] Ahhh, nagging and wanking about the robots.txt file [15:33] How I love this conversation [15:36] CoJaBo: You know that IA rarely *actually* deletes stuff, right? They often 'dark' things though. ie. the content is obviously still at the IA, but you can't access anonymously [15:41] ersi_: Forr all practical intents and purposes, its gone tho. Theres no way to access it, and nobody seems to care. [15:42] Yeah, I certainly don't either. This isn't #internetarchive either though [15:42] The official fforums is all they have, and they're 90% spam [15:42] Sorry for the strong attitude, but people have been bringing this up - something like, oh.. about all the time. [15:42] I tried their contact form too, no response there either. [15:43] There's no fire. [15:43] There's no smoke. [15:43] Calm down. [15:43] And do what? [15:43] * SmileyG_ still doesn't actually know what the issue is exactly [15:43] How about something productive? Like.. download something. Archive something? [15:44] Collect something? Write metadata describing items you've collected? I dunno, endless possibilities [15:44] SmileyG_: DomainSponsor erasing two million sites from the Internet Archive. [15:44] erasing, or taking offline? [15:44] "There's no fire. There's no smoke." I can't visit sites in IA because a domain napper blocked the IA bot. How is that not fire or smoke? [15:44] SmileyG: From the end user perspective? Erasing. [15:45] CoJaBo: no, in actual fact reality whats happening. [15:45] nitro2k01: Its still a Royal PITA tho, isn't it? [15:45] Yup [15:45] CoJaBo: This isn't the Internet Archive. [15:46] Its the closest I could find tho [15:46] Okay, you want to do something? Start a crawler and save sites to WARC files. [15:46] Trying to find a group that uses IA, as maybe they'll have more influence. [15:46] The more the merrier [15:47] I don't have nearly enough disks or bandwidth to start my own [15:47] pffffffft [15:47] at least my excuse is reasonable [15:47] I can recommend wget (which has support to write to WARC, if you have a recent version). Or use tef's crawler ( http://github.com/tef/crawler ) [15:47] CoJaBo: Get out [15:47] ersi: plz carry on, maybe you'll give me a hint on what I do next... [15:48] You know you can always start something?! [15:48] even if you don't have 2PB of disk [15:48] * SmileyG wants to save a specific forum he uses [15:48] SmileyG: I'd say, try to save that then! [15:48] ersi: it sounds like a good plan right? [15:48] Yeah, totally :-)! [15:49] Except I'm kind of confused about what to do next. [15:49] I have the seesaw and all those tools (the special wget) [15:49] SmileyG: A little wget magic :P [15:50] Its trivial if you already know what you want archived lol.. Not so much when you just want too browse the web without running into dead links that really are archived somewhere, but cannot be accessed by anyone -_-' [15:51] ersi: if this is so annoying and bloodboiling for you, make a "FAQ" on the wiki or so? [15:51] ..actually, I wonder if theres a list of expired domains that DS uses.. That'd actually work if I could just downnload them before they nuked everything... [15:51] CoJaBo: you can see domain expiry dates in the whois. companies scan that. [15:52] Cept IA probably doesn't permit crawling their own site, and I can't know a sites expired till it actually is anyway.. gah. [15:53] Schbirid: zomg idea [15:53] "Warrior" project that has a list of "soon expiring" domains [15:53] hand them out to idle warrior clients [15:53] CoJaBo: whois is a domain thing. try it. open a terminal, "whois archive.org" for example [15:53] SmileyG: have fun with 29349817923816948 spam domains per day [15:54] Schbirid: I know what it is; the problem is getting the ones that DS is going after. To do that, you also need search rankings. [15:54] also, domain expiry has not much to do with a site shutting down [15:54] you could try the alexa toplist or the one from compete (or was it quantcast) [15:55] Schbirid: ok, have them "not" so idle .... display a list of domains due to explire and let them see if they want to grab any.... [15:55] Plus, a good many of them lose their hosting before the expiry; knowing when its going to expire is no good if the host is already gone :/ [15:55] SmileyG: you'll be the one to do that :P [15:55] :) [15:56] so how do I go about even thinking about backing up this site? [15:57] what site? [15:57] http://www.gamestm.co.uk/forum/ [15:57] yay [15:57] yey? [15:58] please document your thoughts on the wiki, forums are a pain in the ass and it would be great if we started making notes how to mirror them properly [15:58] I have no thoughts yet. I have no *clue* where to start :S [15:58] wget -m -np http://www.gamestm.co.uk/forum/ [15:59] -a logfile.log [15:59] looks fun. [16:00] this is my normal starting point "wget -a URL_DATE.log -nv --adjust-extension --convert-links --page-requisites --span-hosts -D domains.that,it.should.span.com -m -np --warc-file=DOMAINURL_DATE URL" [16:00] i usually include a user-agent with my mail address and saying that i want to archive it because i am so nice [16:01] o_O [16:05] well this is working :D [16:05] Once its done, can I easily WARC it? Or should i be doing that from the very start? [16:06] from the very start [16:06] * SmileyG aborts [16:06] you need a wget that does warc of course [16:07] wget-warc [16:07] :D [16:07] pulling it from mobileme. [16:08] Right [16:08] off home. [16:08] o/ [16:16] SmileyG: You should always do that from the start :) [16:17] CoJaBo: have you sent that information to IA staff, or is it just on their forums? [16:18] I'd like to know what "removing pages from the Wayback Machine" means before getting winded up about it [16:19] It doesn't mean anything, he's just being boring [16:19] if that means "irrevocably deleting data" then I think there's a case to be made, but if it just means "revoking public access" then, yes, it can be a problem but it's not one that can't be calmly solved [16:19] It only means that if you input one of the domains which now have a robots.txt - you won't get any results from the Wayback machine. [16:19] The content is still at IA though [16:19] (given the people who work at IA, I would be surprised if it's the former) [16:19] IA doesn't delete [16:20] ersi: that's my expectation. however I'd just like confirmation from IA about that [17:30] I'm sure the issue can be calmly resolved, if someone finds a way to actually reach the people at IA [17:31] I'm sure it's a policy issue that needs to be thoroughly discussed etc, and maybe it will be calmly resolved in 2018 [18:50] yipdw: I sent it to one of their contact forms; no reply. Its also in their forums. [18:53] yipdw: The issue, if it isn't clear, is that a large company is buying up expired domains (two million and counting), and automatically requesting removal of all prior content on those domains (content that they clearly have no rights to). [18:53] The problem is, noone who is actually able to resolve the issue, either at IA or DS, appears to be reachable. [19:00] CoJaBo: I got that part; the bit that I'd like to know is whether the data is actually still around [19:00] the rest can be worked out in time [19:05] yipdw: I'd imagine it is still available, due to the chance for errorneous removals, tho its possible they could actually erase it after some grace period (I first posted about it 2 or 3 years ago I think). The FAQs aren't clear, it just says its removed from the Wayback machine. [19:07] There isn't much point of it still being around tho if theres no way to actually contact either IA or DS; most of the response to other places I've brought it up were very negative ("who cares, the Archive is just copyright infringement anyway") [19:09] i don't understand what buying the domain name has to do with content that was previously hosted on said domain [19:09] infact it has _NOTHING_ to do with it. [19:09] other than the domain won't point to it anymore. [19:14] SmileyG: Domain expires -> Domain is judged to have commercial value by DomainSponsor's "patented algorithms" -> DomainSponsor's systems automatically register the domain -> DomainSponsor hosts a robots.txt on that domain with the lines, specified in the removal FAQ, to remove all content (including prior content) from the archive -> Regardless of whether or not it was deleted, there is no longer a way for users to access any conten [19:15] you miss understand how domains work [19:15] domain explires, they buy it, they point it to their name servers; the end. [19:16] unless IA scans existing domains and then removes content due to that, which is extremely stupid? [19:18] .....er, yes.... When IA's crawler revisits the domain, now in possession of DomainSponsor and hosting a "parking page", it sees that robots.txt and disables access to any prior pages on that domain (the parking page, of course, but also that of the prior owner). [19:18] IA does not delete data. [19:18] thats a bit stupid :/ [19:18] chronomex: yah ok, they block it. but thats errrm retarted [19:18] by design. [19:18] I could buy any old domain and block it. [19:19] it's stupid, but it's the least stupid of most alternatives [19:19] least stupid : don't do anyhting. [19:19] chronomex: I'm using "delete" from the end-users' perspective. The data is gone, unless you happen to work at IA. [19:19] whois the domain; oh look its changed, ignore them. [19:19] or "If this is your content, let us know"... [19:19] is justin bieber going to see selena gomez naked? [19:20] ...yes, thank you for that visual [19:20] * CoJaBo barfs [19:20] CoJaBo: ok, its stupid but errrm bugger all you can do about it because its idiotic. [19:28] Being retarded about Internet Archive really belongs in #archiveteam-bs - the off-topic chat [19:29] And with that I'm going to be quiet again :D [19:29] Not that strange, no projects really going on. [19:30] I'm playing around with tef's crawler and hanzotools [23:54] is justin bieber going to see selena gomez naked?