Time |
Nickname |
Message |
00:03
π
|
SketchCow |
[6:59:58 PM] Kenji Nagahashi (Internet Archive): poe-news.com and poetv.com have been blocking crawler since 14th 11AM UTC. |
00:03
π
|
SketchCow |
[7:00:31 PM] Jason Scott: YOU SHALL BROWSE US, NEVERMORE |
00:03
π
|
SketchCow |
[7:03:17 PM] Jason Scott: THERE I WAS, GENTLY WEBSERVING / WHEN I HEARD A SOUND UNNERVING / THE SOUND OF CRAWLERS MANY AS THEY CAME UPON MY DOOR |
00:04
π
|
SketchCow |
[7:03:50 PM] Jason Scott: WAS THAT SOUND KENJI LAUGHING / AS MY DOORS THEY KEPT ON RAPPING / EVER RAPPING FROM THE CRAWLERS KNOCKING ON MY SERVER'S DOOR |
00:04
π
|
SketchCow |
[7:04:08 PM] Jason Scott: CAME A VOICE THEN: "404." |
00:17
π
|
bsmith094 |
SketchCow: was my peonews targz acceptable |
00:17
π
|
bsmith094 |
the files and warc were inside it |
00:18
π
|
SketchCow |
Haven't looked yet. |
06:23
π
|
Wyatt |
Oh yes, Splinder. The tracker says we're done, but I finally ran the check-dld.sh and found about a thousand. Should I run them over again? |
06:35
π
|
Wyatt |
And the python script that verifies. How should I use/interpret the results from that? Should I run dld-streamer on <(grep "Error in" verify.log |cut -d" " -f4|awk -F"/" '{ print $2":"$6 }')? |
18:19
π
|
yipdw |
goddamn, OpenSSL has the worst error messages ever |
18:20
π
|
yipdw |
like "error 20 at 0 depth lookup:unable to get local issuer certificate" |
18:21
π
|
yipdw |
a far better error message would indicate which part of the certificate chain verification failed |
18:21
π
|
yipdw |
with human-readable text, i.e. the name of the issuer |
18:24
π
|
emijrp |
#WikiTeam is recruiting, we need help archiving zillion wikis http://groups.google.com/group/wikiteam-discuss/browse_thread/thread/2de4428e60fc64f5 |
18:27
π
|
kennethre |
yipdw: ssl is wondeful #lolol |
18:27
π
|
kennethre |
*wonderful |
18:28
π
|
kennethre |
emijrp: I'd love to help out |
18:30
π
|
yipdw |
kennethre: I ran into a really strange problem with OpenSSL verification that led me to this |
18:30
π
|
yipdw |
namely, I've some workers that have AMQP subscriptions that just stopped responding until I kicked them |
18:30
π
|
yipdw |
and then I got a flood of OpenSSL verification errors on their logs |
18:30
π
|
yipdw |
no idea if the problems are related |
18:31
π
|
emijrp |
kennethre: great, read the instructions, and ask me if needed, you can start with a small wiki (just some thousands pages) |
18:32
π
|
kennethre |
ew, urllib2 |
19:33
π
|
kennethre |
does anyoneΓ’ΒΒ¦ archive dns? |
19:34
π
|
kennethre |
with this whole SOPA thing, would be interesting to have a snapshot of the top n site's records |
19:34
π
|
kennethre |
jic |
20:06
π
|
yipdw |
kennethre: I'm wondering how you'd get started with such a thing: are you thinking of running e.g. dig on the Alexa Top 500? |
20:06
π
|
chronomex |
it's not hard to get a list of all .com |
20:06
π
|
yipdw |
because it seems to me that the sites most affected aren't going to be the Top 500 |
20:06
π
|
yipdw |
it's going to be the small .orgs, .nets, etc |
20:06
π
|
kennethre |
chronomex: how would you get that list? |
20:07
π
|
kennethre |
yipdw: absolutely |
20:07
π
|
yipdw |
I'm not sure how you'd get that list |
20:07
π
|
yipdw |
efficiently ;P |
20:07
π
|
kennethre |
once you have it, it wouldn't be hard to stick them all in a database |
20:07
π
|
yipdw |
I wonder if there's a way to scrape registries |
20:08
π
|
kennethre |
I wonder if opendns has any open data |
20:09
π
|
chronomex |
yipdw: http://www.verisigninc.com/en_US/products-and-services/domain-name-services/grow-your-domain-name-business/analyze/tld-zone-access/index.xhtml |
20:09
π
|
yipdw |
oh |
20:09
π
|
chronomex |
yeah. |
20:09
π
|
chronomex |
it's that hard. |
20:09
π
|
yipdw |
that's a way, I guess |
20:09
π
|
chronomex |
you need a static ip and some paperwork. |
20:09
π
|
chronomex |
it's really complicated |
20:09
π
|
kennethre |
blah |
20:10
π
|
bsmith094 |
why would you need access? |
20:10
π
|
yipdw |
that's not really complicated, but it's annoying |
20:10
π
|
yipdw |
I don't think "we want to archive DNS in case of SOPA apocalypse" is something they're going to approve |
20:10
π
|
chronomex |
During the term of this Agreement, you may use the data for any legal purpose, not prohibited under Section 4 below. |
20:10
π
|
bsmith094 |
oh |
20:11
π
|
bsmith094 |
its possible to archive DNS? |
20:11
π
|
kennethre |
of course |
20:11
π
|
bsmith094 |
wouldnt that just be whackamole? |
20:11
π
|
kennethre |
could be fun :) |
20:11
π
|
yipdw |
it's no more whack-a-mole than archiving a live site |
20:11
π
|
bsmith094 |
yeah but how big would it be, gb wise |
20:11
π
|
chronomex |
the tricky thing is reaching all the names underneath the zonefile |
20:12
π
|
yipdw |
many |
20:12
π
|
chronomex |
bsmith094: .con .net is about 7G |
20:12
π
|
kennethre |
depends oh now many sites you store |
20:12
π
|
yipdw |
chronomex: where'd you get that figure from? |
20:12
π
|
chronomex |
serp snippet that showed up when I was looking for the zonefile itself |
20:12
π
|
yipdw |
huh, interesting |
20:12
π
|
bsmith094 |
wait, would it jst be blah.com>> 159.254.222.1 a bunch pf these in text? |
20:12
π
|
kennethre |
www.spyrush.com/tld/ (if it loads) |
20:12
π
|
chronomex |
http://www.pir.org/help/access |
20:13
π
|
yipdw |
bsmith094: something like that; I'd prefer dig-style output |
20:13
π
|
yipdw |
makes it easier to reconstruct records for DNS servers |
20:13
π
|
kennethre |
wonder if you can download the whois database |
20:13
π
|
bsmith094 |
yes, i think somebody here tried |
20:13
π
|
yipdw |
e.g. www.l.google.com 205 IN A 74.125.225.50 |
20:13
π
|
yipdw |
etc |
20:15
π
|
yipdw |
well, hmm |
20:15
π
|
yipdw |
the longest domain name you can have is what, 249 characters for a 3-character TLD |
20:15
π
|
bsmith094 |
What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good. |
20:15
π
|
bsmith094 |
up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc) |
20:15
π
|
bsmith094 |
up vote 4 down vote Whois API offers the entire whois database download in major GTLDs(.com,.net,.org,.us,.biz.,.mobi,etc) |
20:15
π
|
bsmith094 |
What's more, it provides archived historic whois database in both parsed and raw format for download as CSV files. I used a partial database download for a company project(SEO related) and the data quality was pretty good. |
20:16
π
|
yipdw |
(249 * |LDH|)! |
20:16
π
|
yipdw |
totally doable given multiple universes |
20:16
π
|
bsmith094 |
sorry for the double post that was form a search i ran http://www.whoisxmlapi.com/whois-database-download.php |
20:16
π
|
chronomex |
63+63+63 |
20:16
π
|
bsmith094 |
ldh? |
20:16
π
|
chronomex |
yipdw: or quantum internet protocol |
20:16
π
|
chronomex |
bsmith094: letter-digit-hyphen |
20:16
π
|
yipdw |
yeah |
20:17
π
|
bsmith094 |
OMFG thats a fricken huge number?!?!? |
20:17
π
|
bsmith094 |
if im reading that notation right |
20:17
π
|
yipdw |
yes, it is a fricken huge number |
20:17
π
|
yipdw |
I wasn't seriously considering exhaustive search |
20:17
π
|
chronomex |
heh |
20:18
π
|
bsmith094 |
the complete whois database according to the link, i gae, is 125million records |
20:21
π
|
yipdw |
well, there's no way to dump a whois database via the whois protocol |
20:21
π
|
yipdw |
according to RFC 3912 |
20:22
π
|
chronomex |
yeah, the whois protocol isn't designed to allow you to spam me, for all values of me |
20:22
π
|
bsmith094 |
why not, that seem semi reasonable |
20:22
π
|
yipdw |
because there just isn't a way to do it |
20:22
π
|
yipdw |
I was going to start at a list of TLDs |
20:23
π
|
yipdw |
which (for now) is limited |
20:23
π
|
yipdw |
from there you can get to the WHOIS for that TLD by [tld].whois-servers.net |
20:23
π
|
yipdw |
but then I get stuck |
20:23
π
|
chronomex |
the whois utility has a hardcoded list, I think |
20:24
π
|
yipdw |
oh |
20:24
π
|
yipdw |
https://www.dns-oarc.net/oarc/data/zfr |
20:24
π
|
yipdw |
that might work |
20:31
π
|
idle |
say hi |