#archiveteam 2011-12-16,Fri

↑back Search

Time	Nickname	Message
00:03 ^🔗	SketchCow	[6:59:58 PM] Kenji Nagahashi (Internet Archive): poe-news.com and poetv.com have been blocking crawler since 14th 11AM UTC.
00:03 ^🔗	SketchCow	[7:00:31 PM] Jason Scott: YOU SHALL BROWSE US, NEVERMORE
00:03 ^🔗	SketchCow	[7:03:17 PM] Jason Scott: THERE I WAS, GENTLY WEBSERVING / WHEN I HEARD A SOUND UNNERVING / THE SOUND OF CRAWLERS MANY AS THEY CAME UPON MY DOOR
00:04 ^🔗	SketchCow	[7:03:50 PM] Jason Scott: WAS THAT SOUND KENJI LAUGHING / AS MY DOORS THEY KEPT ON RAPPING / EVER RAPPING FROM THE CRAWLERS KNOCKING ON MY SERVER'S DOOR
00:04 ^🔗	SketchCow	[7:04:08 PM] Jason Scott: CAME A VOICE THEN: "404."
00:17 ^🔗	bsmith094	SketchCow: was my peonews targz acceptable
00:17 ^🔗	bsmith094	the files and warc were inside it
00:18 ^🔗	SketchCow	Haven't looked yet.
06:23 ^🔗	Wyatt	Oh yes, Splinder. The tracker says we're done, but I finally ran the check-dld.sh and found about a thousand. Should I run them over again?
06:35 ^🔗	Wyatt	And the python script that verifies. How should I use/interpret the results from that? Should I run dld-streamer on <(grep "Error in" verify.log \|cut -d" " -f4\|awk -F"/" '{ print $2":"$6 }')?
18:19 ^🔗	yipdw	goddamn, OpenSSL has the worst error messages ever
18:20 ^🔗	yipdw	like "error 20 at 0 depth lookup:unable to get local issuer certificate"
18:21 ^🔗	yipdw	a far better error message would indicate which part of the certificate chain verification failed
18:21 ^🔗	yipdw	with human-readable text, i.e. the name of the issuer
18:24 ^🔗	emijrp	#WikiTeam is recruiting, we need help archiving zillion wikis http://groups.google.com/group/wikiteam-discuss/browse_thread/thread/2de4428e60fc64f5
18:27 ^🔗	kennethre	yipdw: ssl is wondeful #lolol
18:27 ^🔗	kennethre	*wonderful
18:28 ^🔗	kennethre	emijrp: I'd love to help out
18:30 ^🔗	yipdw	kennethre: I ran into a really strange problem with OpenSSL verification that led me to this
18:30 ^🔗	yipdw	namely, I've some workers that have AMQP subscriptions that just stopped responding until I kicked them
18:30 ^🔗	yipdw	and then I got a flood of OpenSSL verification errors on their logs
18:30 ^🔗	yipdw	no idea if the problems are related
18:31 ^🔗	emijrp	kennethre: great, read the instructions, and ask me if needed, you can start with a small wiki (just some thousands pages)
18:32 ^🔗	kennethre	ew, urllib2
19:33 ^🔗	kennethre	does anyoneâ¦ archive dns?
19:34 ^🔗	kennethre	with this whole SOPA thing, would be interesting to have a snapshot of the top n site's records
19:34 ^🔗	kennethre	jic
20:06 ^🔗	yipdw	kennethre: I'm wondering how you'd get started with such a thing: are you thinking of running e.g. dig on the Alexa Top 500?
20:06 ^🔗	chronomex	it's not hard to get a list of all .com
20:06 ^🔗	yipdw	because it seems to me that the sites most affected aren't going to be the Top 500
20:06 ^🔗	yipdw	it's going to be the small .orgs, .nets, etc
20:06 ^🔗	kennethre	chronomex: how would you get that list?
20:07 ^🔗	kennethre	yipdw: absolutely
20:07 ^🔗	yipdw	I'm not sure how you'd get that list
20:07 ^🔗	yipdw	efficiently ;P
20:07 ^🔗	kennethre	once you have it, it wouldn't be hard to stick them all in a database
20:07 ^🔗	yipdw	I wonder if there's a way to scrape registries
20:08 ^🔗	kennethre	I wonder if opendns has any open data
20:09 ^🔗	chronomex	yipdw: http://www.verisigninc.com/en_US/products-and-services/domain-name-services/grow-your-domain-name-business/analyze/tld-zone-access/index.xhtml
20:09 ^🔗	yipdw	oh
20:09 ^🔗	chronomex	yeah.
20:09 ^🔗	chronomex	it's that hard.
20:09 ^🔗	yipdw	that's a way, I guess
20:09 ^🔗	chronomex	you need a static ip and some paperwork.
20:09 ^🔗	chronomex	it's really complicated
20:09 ^🔗	kennethre	blah
20:10 ^🔗	bsmith094	why would you need access?
20:10 ^🔗	yipdw	that's not really complicated, but it's annoying
20:10 ^🔗	yipdw	I don't think "we want to archive DNS in case of SOPA apocalypse" is something they're going to approve
20:10 ^🔗	chronomex	During the term of this Agreement, you may use the data for any legal purpose, not prohibited under Section 4 below.
20:10 ^🔗	bsmith094	oh
20:11 ^🔗	bsmith094	its possible to archive DNS?
20:11 ^🔗	kennethre	of course
20:11 ^🔗	bsmith094	wouldnt that just be whackamole?
20:11 ^🔗	kennethre	could be fun :)
20:11 ^🔗	yipdw	it's no more whack-a-mole than archiving a live site
20:11 ^🔗	bsmith094	yeah but how big would it be, gb wise
20:11 ^🔗	chronomex	the tricky thing is reaching all the names underneath the zonefile
20:12 ^🔗	yipdw	many
20:12 ^🔗	chronomex	bsmith094: .con .net is about 7G
20:12 ^🔗	kennethre	depends oh now many sites you store
20:12 ^🔗	yipdw	chronomex: where'd you get that figure from?
20:12 ^🔗	chronomex	serp snippet that showed up when I was looking for the zonefile itself
20:12 ^🔗	yipdw	huh, interesting
20:12 ^🔗	bsmith094	wait, would it jst be blah.com>> 159.254.222.1 a bunch pf these in text?
20:12 ^🔗	kennethre	www.spyrush.com/tld/ (if it loads)
20:12 ^🔗	chronomex	http://www.pir.org/help/access
20:13 ^🔗	yipdw	bsmith094: something like that; I'd prefer dig-style output
20:13 ^🔗	yipdw	makes it easier to reconstruct records for DNS servers
20:13 ^🔗	kennethre	wonder if you can download the whois database
20:13 ^🔗	bsmith094	yes, i think somebody here tried
20:13 ^🔗	yipdw	e.g. www.l.google.com 205 IN A 74.125.225.50
20:13 ^🔗	yipdw	etc
20:15 ^🔗	yipdw	well, hmm
20:15 ^🔗	yipdw	the longest domain name you can have is what, 249 characters for a 3-character TLD
20:15 ^🔗	bsmith094	What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good.
20:15 ^🔗	bsmith094	up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc)
20:15 ^🔗	bsmith094	up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc)
20:15 ^🔗	bsmith094	What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good.
20:16 ^🔗	yipdw	(249 * \|LDH\|)!
20:16 ^🔗	yipdw	totally doable given multiple universes
20:16 ^🔗	bsmith094	sorry for the double post that was form a search i ran http://www.whoisxmlapi.com/whois-database-download.php
20:16 ^🔗	chronomex	63+63+63
20:16 ^🔗	bsmith094	ldh?
20:16 ^🔗	chronomex	yipdw: or quantum internet protocol
20:16 ^🔗	chronomex	bsmith094: letter-digit-hyphen
20:16 ^🔗	yipdw	yeah
20:17 ^🔗	bsmith094	OMFG thats a fricken huge number?!?!?
20:17 ^🔗	bsmith094	if im reading that notation right
20:17 ^🔗	yipdw	yes, it is a fricken huge number
20:17 ^🔗	yipdw	I wasn't seriously considering exhaustive search
20:17 ^🔗	chronomex	heh
20:18 ^🔗	bsmith094	the complete whois database according to the link, i gae, is 125million records
20:21 ^🔗	yipdw	well, there's no way to dump a whois database via the whois protocol
20:21 ^🔗	yipdw	according to RFC 3912
20:22 ^🔗	chronomex	yeah, the whois protocol isn't designed to allow you to spam me, for all values of me
20:22 ^🔗	bsmith094	why not, that seem semi reasonable
20:22 ^🔗	yipdw	because there just isn't a way to do it
20:22 ^🔗	yipdw	I was going to start at a list of TLDs
20:23 ^🔗	yipdw	which (for now) is limited
20:23 ^🔗	yipdw	from there you can get to the WHOIS for that TLD by [tld].whois-servers.net
20:23 ^🔗	yipdw	but then I get stuck
20:23 ^🔗	chronomex	the whois utility has a hardcoded list, I think
20:24 ^🔗	yipdw	oh
20:24 ^🔗	yipdw	https://www.dns-oarc.net/oarc/data/zfr
20:24 ^🔗	yipdw	that might work
20:31 ^🔗	idle	say hi

irclogger-viewer