#archiveteam 2011-12-16,Fri

↑back Search

Time Nickname Message
00:03 πŸ”— SketchCow [6:59:58 PM] Kenji Nagahashi (Internet Archive): poe-news.com and poetv.com have been blocking crawler since 14th 11AM UTC.
00:03 πŸ”— SketchCow [7:00:31 PM] Jason Scott: YOU SHALL BROWSE US, NEVERMORE
00:03 πŸ”— SketchCow [7:03:17 PM] Jason Scott: THERE I WAS, GENTLY WEBSERVING / WHEN I HEARD A SOUND UNNERVING / THE SOUND OF CRAWLERS MANY AS THEY CAME UPON MY DOOR
00:04 πŸ”— SketchCow [7:03:50 PM] Jason Scott: WAS THAT SOUND KENJI LAUGHING / AS MY DOORS THEY KEPT ON RAPPING / EVER RAPPING FROM THE CRAWLERS KNOCKING ON MY SERVER'S DOOR
00:04 πŸ”— SketchCow [7:04:08 PM] Jason Scott: CAME A VOICE THEN: "404."
00:17 πŸ”— bsmith094 SketchCow: was my peonews targz acceptable
00:17 πŸ”— bsmith094 the files and warc were inside it
00:18 πŸ”— SketchCow Haven't looked yet.
06:23 πŸ”— Wyatt Oh yes, Splinder. The tracker says we're done, but I finally ran the check-dld.sh and found about a thousand. Should I run them over again?
06:35 πŸ”— Wyatt And the python script that verifies. How should I use/interpret the results from that? Should I run dld-streamer on <(grep "Error in" verify.log |cut -d" " -f4|awk -F"/" '{ print $2":"$6 }')?
18:19 πŸ”— yipdw goddamn, OpenSSL has the worst error messages ever
18:20 πŸ”— yipdw like "error 20 at 0 depth lookup:unable to get local issuer certificate"
18:21 πŸ”— yipdw a far better error message would indicate which part of the certificate chain verification failed
18:21 πŸ”— yipdw with human-readable text, i.e. the name of the issuer
18:24 πŸ”— emijrp #WikiTeam is recruiting, we need help archiving zillion wikis http://groups.google.com/group/wikiteam-discuss/browse_thread/thread/2de4428e60fc64f5
18:27 πŸ”— kennethre yipdw: ssl is wondeful #lolol
18:27 πŸ”— kennethre *wonderful
18:28 πŸ”— kennethre emijrp: I'd love to help out
18:30 πŸ”— yipdw kennethre: I ran into a really strange problem with OpenSSL verification that led me to this
18:30 πŸ”— yipdw namely, I've some workers that have AMQP subscriptions that just stopped responding until I kicked them
18:30 πŸ”— yipdw and then I got a flood of OpenSSL verification errors on their logs
18:30 πŸ”— yipdw no idea if the problems are related
18:31 πŸ”— emijrp kennethre: great, read the instructions, and ask me if needed, you can start with a small wiki (just some thousands pages)
18:32 πŸ”— kennethre ew, urllib2
19:33 πŸ”— kennethre does anyoneҀ¦ archive dns?
19:34 πŸ”— kennethre with this whole SOPA thing, would be interesting to have a snapshot of the top n site's records
19:34 πŸ”— kennethre jic
20:06 πŸ”— yipdw kennethre: I'm wondering how you'd get started with such a thing: are you thinking of running e.g. dig on the Alexa Top 500?
20:06 πŸ”— chronomex it's not hard to get a list of all .com
20:06 πŸ”— yipdw because it seems to me that the sites most affected aren't going to be the Top 500
20:06 πŸ”— yipdw it's going to be the small .orgs, .nets, etc
20:06 πŸ”— kennethre chronomex: how would you get that list?
20:07 πŸ”— kennethre yipdw: absolutely
20:07 πŸ”— yipdw I'm not sure how you'd get that list
20:07 πŸ”— yipdw efficiently ;P
20:07 πŸ”— kennethre once you have it, it wouldn't be hard to stick them all in a database
20:07 πŸ”— yipdw I wonder if there's a way to scrape registries
20:08 πŸ”— kennethre I wonder if opendns has any open data
20:09 πŸ”— chronomex yipdw: http://www.verisigninc.com/en_US/products-and-services/domain-name-services/grow-your-domain-name-business/analyze/tld-zone-access/index.xhtml
20:09 πŸ”— yipdw oh
20:09 πŸ”— chronomex yeah.
20:09 πŸ”— chronomex it's that hard.
20:09 πŸ”— yipdw that's a way, I guess
20:09 πŸ”— chronomex you need a static ip and some paperwork.
20:09 πŸ”— chronomex it's really complicated
20:09 πŸ”— kennethre blah
20:10 πŸ”— bsmith094 why would you need access?
20:10 πŸ”— yipdw that's not really complicated, but it's annoying
20:10 πŸ”— yipdw I don't think "we want to archive DNS in case of SOPA apocalypse" is something they're going to approve
20:10 πŸ”— chronomex During the term of this Agreement, you may use the data for any legal purpose, not prohibited under Section 4 below.
20:10 πŸ”— bsmith094 oh
20:11 πŸ”— bsmith094 its possible to archive DNS?
20:11 πŸ”— kennethre of course
20:11 πŸ”— bsmith094 wouldnt that just be whackamole?
20:11 πŸ”— kennethre could be fun :)
20:11 πŸ”— yipdw it's no more whack-a-mole than archiving a live site
20:11 πŸ”— bsmith094 yeah but how big would it be, gb wise
20:11 πŸ”— chronomex the tricky thing is reaching all the names underneath the zonefile
20:12 πŸ”— yipdw many
20:12 πŸ”— chronomex bsmith094: .con .net is about 7G
20:12 πŸ”— kennethre depends oh now many sites you store
20:12 πŸ”— yipdw chronomex: where'd you get that figure from?
20:12 πŸ”— chronomex serp snippet that showed up when I was looking for the zonefile itself
20:12 πŸ”— yipdw huh, interesting
20:12 πŸ”— bsmith094 wait, would it jst be blah.com>> 159.254.222.1 a bunch pf these in text?
20:12 πŸ”— kennethre www.spyrush.com/tld/ (if it loads)
20:12 πŸ”— chronomex http://www.pir.org/help/access
20:13 πŸ”— yipdw bsmith094: something like that; I'd prefer dig-style output
20:13 πŸ”— yipdw makes it easier to reconstruct records for DNS servers
20:13 πŸ”— kennethre wonder if you can download the whois database
20:13 πŸ”— bsmith094 yes, i think somebody here tried
20:13 πŸ”— yipdw e.g. www.l.google.com 205 IN A 74.125.225.50
20:13 πŸ”— yipdw etc
20:15 πŸ”— yipdw well, hmm
20:15 πŸ”— yipdw the longest domain name you can have is what, 249 characters for a 3-character TLD
20:15 πŸ”— bsmith094 What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good.
20:15 πŸ”— bsmith094 up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc)
20:15 πŸ”— bsmith094 up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc)
20:15 πŸ”— bsmith094 What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good.
20:16 πŸ”— yipdw (249 * |LDH|)!
20:16 πŸ”— yipdw totally doable given multiple universes
20:16 πŸ”— bsmith094 sorry for the double post that was form a search i ran http://www.whoisxmlapi.com/whois-database-download.php
20:16 πŸ”— chronomex 63+63+63
20:16 πŸ”— bsmith094 ldh?
20:16 πŸ”— chronomex yipdw: or quantum internet protocol
20:16 πŸ”— chronomex bsmith094: letter-digit-hyphen
20:16 πŸ”— yipdw yeah
20:17 πŸ”— bsmith094 OMFG thats a fricken huge number?!?!?
20:17 πŸ”— bsmith094 if im reading that notation right
20:17 πŸ”— yipdw yes, it is a fricken huge number
20:17 πŸ”— yipdw I wasn't seriously considering exhaustive search
20:17 πŸ”— chronomex heh
20:18 πŸ”— bsmith094 the complete whois database according to the link, i gae, is 125million records
20:21 πŸ”— yipdw well, there's no way to dump a whois database via the whois protocol
20:21 πŸ”— yipdw according to RFC 3912
20:22 πŸ”— chronomex yeah, the whois protocol isn't designed to allow you to spam me, for all values of me
20:22 πŸ”— bsmith094 why not, that seem semi reasonable
20:22 πŸ”— yipdw because there just isn't a way to do it
20:22 πŸ”— yipdw I was going to start at a list of TLDs
20:23 πŸ”— yipdw which (for now) is limited
20:23 πŸ”— yipdw from there you can get to the WHOIS for that TLD by [tld].whois-servers.net
20:23 πŸ”— yipdw but then I get stuck
20:23 πŸ”— chronomex the whois utility has a hardcoded list, I think
20:24 πŸ”— yipdw oh
20:24 πŸ”— yipdw https://www.dns-oarc.net/oarc/data/zfr
20:24 πŸ”— yipdw that might work
20:31 πŸ”— idle say hi

irclogger-viewer