[00:03] [6:59:58 PM] Kenji Nagahashi (Internet Archive): poe-news.com and poetv.com have been blocking crawler since 14th 11AM UTC. [00:03] [7:00:31 PM] Jason Scott: YOU SHALL BROWSE US, NEVERMORE [00:03] [7:03:17 PM] Jason Scott: THERE I WAS, GENTLY WEBSERVING / WHEN I HEARD A SOUND UNNERVING / THE SOUND OF CRAWLERS MANY AS THEY CAME UPON MY DOOR [00:04] [7:03:50 PM] Jason Scott: WAS THAT SOUND KENJI LAUGHING / AS MY DOORS THEY KEPT ON RAPPING / EVER RAPPING FROM THE CRAWLERS KNOCKING ON MY SERVER'S DOOR [00:04] [7:04:08 PM] Jason Scott: CAME A VOICE THEN: "404." [00:17] SketchCow: was my peonews targz acceptable [00:17] the files and warc were inside it [00:18] Haven't looked yet. [06:23] Oh yes, Splinder. The tracker says we're done, but I finally ran the check-dld.sh and found about a thousand. Should I run them over again? [06:35] And the python script that verifies. How should I use/interpret the results from that? Should I run dld-streamer on <(grep "Error in" verify.log |cut -d" " -f4|awk -F"/" '{ print $2":"$6 }')? [18:19] goddamn, OpenSSL has the worst error messages ever [18:20] like "error 20 at 0 depth lookup:unable to get local issuer certificate" [18:21] a far better error message would indicate which part of the certificate chain verification failed [18:21] with human-readable text, i.e. the name of the issuer [18:24] #WikiTeam is recruiting, we need help archiving zillion wikis http://groups.google.com/group/wikiteam-discuss/browse_thread/thread/2de4428e60fc64f5 [18:27] yipdw: ssl is wondeful #lolol [18:27] *wonderful [18:28] emijrp: I'd love to help out [18:30] kennethre: I ran into a really strange problem with OpenSSL verification that led me to this [18:30] namely, I've some workers that have AMQP subscriptions that just stopped responding until I kicked them [18:30] and then I got a flood of OpenSSL verification errors on their logs [18:30] no idea if the problems are related [18:31] kennethre: great, read the instructions, and ask me if needed, you can start with a small wiki (just some thousands pages) [18:32] ew, urllib2 [19:33] does anyone… archive dns? [19:34] with this whole SOPA thing, would be interesting to have a snapshot of the top n site's records [19:34] jic [20:06] kennethre: I'm wondering how you'd get started with such a thing: are you thinking of running e.g. dig on the Alexa Top 500? [20:06] it's not hard to get a list of all .com [20:06] because it seems to me that the sites most affected aren't going to be the Top 500 [20:06] it's going to be the small .orgs, .nets, etc [20:06] chronomex: how would you get that list? [20:07] yipdw: absolutely [20:07] I'm not sure how you'd get that list [20:07] efficiently ;P [20:07] once you have it, it wouldn't be hard to stick them all in a database [20:07] I wonder if there's a way to scrape registries [20:08] I wonder if opendns has any open data [20:09] yipdw: http://www.verisigninc.com/en_US/products-and-services/domain-name-services/grow-your-domain-name-business/analyze/tld-zone-access/index.xhtml [20:09] oh [20:09] yeah. [20:09] it's that hard. [20:09] that's a way, I guess [20:09] you need a static ip and some paperwork. [20:09] it's really complicated [20:09] blah [20:10] why would you need access? [20:10] that's not really complicated, but it's annoying [20:10] I don't think "we want to archive DNS in case of SOPA apocalypse" is something they're going to approve [20:10] During the term of this Agreement, you may use the data for any legal purpose, not prohibited under Section 4 below. [20:10] oh [20:11] its possible to archive DNS? [20:11] of course [20:11] wouldnt that just be whackamole? [20:11] could be fun :) [20:11] it's no more whack-a-mole than archiving a live site [20:11] yeah but how big would it be, gb wise [20:11] the tricky thing is reaching all the names underneath the zonefile [20:12] many [20:12] bsmith094: .con .net is about 7G [20:12] depends oh now many sites you store [20:12] chronomex: where'd you get that figure from? [20:12] serp snippet that showed up when I was looking for the zonefile itself [20:12] huh, interesting [20:12] wait, would it jst be blah.com>> 159.254.222.1 a bunch pf these in text? [20:12] www.spyrush.com/tld/ (if it loads) [20:12] http://www.pir.org/help/access [20:13] bsmith094: something like that; I'd prefer dig-style output [20:13] makes it easier to reconstruct records for DNS servers [20:13] wonder if you can download the whois database [20:13] yes, i think somebody here tried [20:13] e.g. www.l.google.com 205 IN A 74.125.225.50 [20:13] etc [20:15] well, hmm [20:15] the longest domain name you can have is what, 249 characters for a 3-character TLD [20:15] What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good. [20:15] up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc) [20:15] up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc) [20:15] What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good. [20:16] (249 * |LDH|)! [20:16] totally doable given multiple universes [20:16] sorry for the double post that was form a search i ran http://www.whoisxmlapi.com/whois-database-download.php [20:16] 63+63+63 [20:16] ldh? [20:16] yipdw: or quantum internet protocol [20:16] bsmith094: letter-digit-hyphen [20:16] yeah [20:17] OMFG thats a fricken huge number?!?!? [20:17] if im reading that notation right [20:17] yes, it is a fricken huge number [20:17] I wasn't seriously considering exhaustive search [20:17] heh [20:18] the complete whois database according to the link, i gae, is 125million records [20:21] well, there's no way to dump a whois database via the whois protocol [20:21] according to RFC 3912 [20:22] yeah, the whois protocol isn't designed to allow you to spam me, for all values of me [20:22] why not, that seem semi reasonable [20:22] because there just isn't a way to do it [20:22] I was going to start at a list of TLDs [20:23] which (for now) is limited [20:23] from there you can get to the WHOIS for that TLD by [tld].whois-servers.net [20:23] but then I get stuck [20:23] the whois utility has a hardcoded list, I think [20:24] oh [20:24] https://www.dns-oarc.net/oarc/data/zfr [20:24] that might work [20:31] say hi