[00:49] SketchCow: i can't download the m4v files of 1up shows in wayback [00:49] for some reason wayback hates downloading video [00:49] or at least zdmedia urls suck [00:59] anyways i may get some full episodes of gma from 2008 [01:30] i'm getting a abc news special called berlin: the last 96 hours [01:30] aired on nov. 12, 1989 [01:33] also i maybe able to get primetime episodes [01:57] i'm uploading total guitar 2009 cds [05:20] so i found some old xml of abcnews: https://web.archive.org/web/20070127172052/http://abcnews.go.com/Video/echoXML? [05:20] sadly i think all links are dead [05:56] schemer.com is dead [06:01] also https://archiveteam.org doesn't have a valid ssl certificate [06:40] so like [06:40] the lego movie was FUCKING AWESOME [06:46] so I've heard [06:46] say more [06:53] it was just [06:53] Lol's the whole time [06:54] i went drunk with friends tbh [06:54] but it was fuckin good [06:55] legos were my childhood so [06:55] i loved it [06:57] also lol will ferrell in it [07:00] does anyone have access to classicnickshows.net forums here? [07:00] i would like a invitation code please [07:42] so i got a good question about this page: http://abcnews.go.com/topics/lifestyle/health/avandia.htm [07:42] its using rtsp://start.real.com path [07:43] does anyone know how to get it? [07:43] i found it funny cause these sort of paths are used back in 2005 [07:43] so if this more recent one works i may be able to get the older real media files [09:27] uploaded: https://archive.org/details/Better_Mind_The_Computer_BBC_Documentary [10:41] good news on the abcnews front [10:41] some missing episodes maybe saved by just grabbing the _embed version [11:15] sigh, vbox being a PITA again D: [13:33] i'm off to bed for now [13:33] i'm uploading the first 4 months of 2008 world news today [13:33] *tonight [13:42] Hurrah [13:57] haha, opened door to let cold air in [13:57] followed immediately by accusing stare from my cat [16:39] does anybody have a copy of the MediaDefender sources laying around, by any chance [16:39] the torrents appear to have died... [16:39] (another reason why torrents are not a suitable archival mechanism) [16:44] I do [17:58] I believe SketchCow may be able to be indirectly credited for this http://ah.roosterteeth.com/archive/?id=8733 [18:01] this NEST stuff makes me laugh [18:01] all these people with these $130 smoke detectors which will stop working the moment google gets bored. [18:10] SmileyG: nest is one of those things that just makes me go "WHYYYYY" [18:10] like, how could this possibly be a good/useful idea [18:14] hello my favorite archivists, how is the weekend so far? [18:14] https://www.youtube.com/watch?v=d0PtWLPmnTg - short video about the LOIRP guys [18:15] morten77: slowly approaching the day of my babys birth, and also the day i end up unemployed.... so exciting and scary in equal amounts. [18:17] oh [18:18] I wonder how to archive website that loads parts of itself with ajax? [18:19] morten77: painfully :D [18:20] I'm mostly thinking about slashdot primarly... webpages that don't really need any ajax but uses it anyway, there is no way you just request the whole page with all the comments for an article... you only get the first 100 comments [18:22] the rest have to be got with javascript, I can't just do a web.archive.org/record/ on them [18:24] And I really think now is the time to preserve all the good thoughts that have been written there, before dice destroys Slashdot completly. before it is too late... :( [18:30] SmileyG: so when it the planned date for the baby to see the day of light? [18:30] 15th is the "date". [18:31] I think she'll be here before then. [18:31] yeah you never know [18:36] morten77: actually it is possible [18:36] drop down to discussion system 1 [18:37] which is the idea in https://github.com/ArchiveTeam/slashdot-grab [18:37] oh. [18:38] I didn't know there was a project for it. nice! [18:39] hello my favorite archivists, how is the weekend so far? [18:39] not great :( [18:39] joepie91: what is not great? [18:42] my weekend, heh [18:42] my day so far has consisted of being frustrated at my shitty internet [18:42] DFJustin, 94464 images from yandere, after going over original posts $(seq 1 279947) (includes things like https://yande.re/image/5ac8e5d3cfcd328e8b712fe1e58167c7/yande.re%20273254%20partial_scan.png) [19:46] f*** f*** f*** i deleted a client's primary vm [19:47] last working backup: 2/02/2014 f**** [19:47] tomarrow will be really fine [21:14] boo SketchCow, your domain WHOIS broke my WHOIS library :( [21:15] like, almost completely [21:15] lol [21:32] nico, time to look into undelete then? [21:38] every software i tried failed [21:38] they don't event see a hint [21:38] that these 2 files existed [22:33] I hate it when this happens. [22:33] http://owely.com/31oMois [22:35] joepie91: what about it busted your whois library? [22:38] xmc: apparently my hotfix to work around a getting-stuck of the Python regex engine... broke the .at regex [22:38] somehow [22:39] I've removed the hotfix from NIC contact handle regexes, and added a FIXME [22:39] worry for later [22:39] All tests passed! [22:39] :D [22:39] monster commit incoming [22:40] * xmc nods quietly [22:40] I really like the NAPTR dns record type [22:40] it's just complex enough to be very interesting [22:41] xmc: https://github.com/joepie91/python-whois/commit/f2ce1d7b8ab566113a06c84259e35a36024974f8 [22:42] xmc: I'm not sure I want to think about any more parsing tonight [22:42] lol [22:42] hahaha [22:42] seriously [22:42] regex parsing must have the lowest LOC/COC pay-off versus effort [22:43] in all of software dev [22:43] it can literally take 30 minutes to figure out that you need to add one damn character somewhere in a regex [22:43] to make everything not break [22:43] lol [22:43] sounds about right [22:44] xmc: did I mention that there is one particular registrar that has an incredibly bad format [22:44] point me to an example? [22:44] also, only one? [22:44] xmc: https://github.com/joepie91/python-whois/blob/f2ce1d7b8ab566113a06c84259e35a36024974f8/test/data/communigal.net [22:44] like, what the fuck, I don't even [22:44] and it gets worse [22:44] because they also have this [22:44] https://github.com/joepie91/python-whois/blob/f2ce1d7b8ab566113a06c84259e35a36024974f8/test/data/365calendars.com [22:45] xmc: that said, AFNIC is pretty fucking awful too [22:45] lol [22:45] oh that's awesome [22:45] who allowed these people on the internet [22:45] xmc: brace yourself [22:45] https://github.com/joepie91/python-whois/blob/f2ce1d7b8ab566113a06c84259e35a36024974f8/test/data/ovh.fr [22:45] AFNIC managed to build a format that is hard to parse for computers AND humans! [22:45] that's kind of scungy, yeah [22:45] pretty sure they solved some Hard Problem in software engineering there [22:45] lol [22:46] seriously, AFNIC is a pain [22:46] also because of all the variation [22:46] https://github.com/joepie91/python-whois/blob/master/pythonwhois/parse.py#L479-L481 [22:46] go figure, they're responsible for 3 out of 4 contact regexes [22:47] and I've been unable to compress them into one regex without making it -completely- unmaintainable, heh [22:47] sometimes multiple regexes are what you need [22:47] yup :( [22:47] but yeah, AFNIC kept me busy for... 3 hours? or so? [22:47] I have a phone number formatting system [22:47] figuring out all their wonky variations [22:48] ? [22:48] it consists of a sql table with a bunch of sql globs to match phone numbers [22:48] and substitution rules to insert and remove digits and punctuation to make it human-recognizable as a phone number [22:48] hehe [22:48] it's not even really parsing [22:48] :P [22:48] just something pretending to do parsing [22:48] lol [22:49] (I've done that a few times) [22:49] what country are you in, I can give an example format rule [22:49] actually, what is your country's dialing code [22:49] :P [22:49] actually, pythonwhois' normalization isn't too different in that sense... it doesn't actually parse anything, just tries some 'best effort' normalization rules [22:49] xmc: +31 [22:49] Netherlands [22:49] and we have about a quantizillion different ways of writing down phone numbers [22:49] oh boy, that's a good one [22:49] 1sec [22:49] there's one "official" way but nobody actually uses that [22:50] so that tells you how bad it is :P [22:50] http://www.rafb.me/results/zpIgev70.html [22:50] * joepie91 is not entirely certain what he's looking at [22:50] (also, those UUIDs do not at all look like UUIDs...) [22:51] they are too uuids [22:51] google_uuid is not, but that's google's fault [22:51] they don't look random enough [22:51] in fact [22:51] it looks incremental [22:51] they're v1 uuids, which have time and machine id as a component [22:51] so can it [22:51] surely that shouldn't have such a large impact on randomness? [22:51] oh well :P [22:51] anyway, match_string gets the digits [22:52] then the underscore positions are distributed as in the other columns [22:52] okay, and how do you deal with .31, +31, 0031, absence of 31 [22:52] etc [22:52] I deal with that on ingest, everything is stored as full international but without the + [22:52] mmm [22:53] I'm not sure 090_ is valid [22:53] I think it's 09__ nowadays [22:53] not sure though [22:53] orly, maybe I'll update that [22:53] either way, forget about consistent spacing in NL phone numbers [22:53] I usually only do it when I see something caught by my catchall rule [22:53] people usually space them out in such a way that they're easy to remember [22:53] fuckers [22:53] why can't they conform to my rules [22:53] so if there are repeated/related digits, that's how people space them [22:53] also, xmc, you're doing it wrong with the 01__ etc [22:54] oh yeah? [22:54] every Dutch phone number is 10 digits total, with an area/type code of 3 or 4 digits (except for 06 which is for mobile numbers) [22:54] however [22:54] 010 - something is not inherently different from 0165 - something [22:54] it just has 7 local digits instead of 8 [22:54] it doesn't magically lack spacing [22:54] :P [22:55] hm, ok [22:55] the official way is to not have spacing -at all- except for between the area/type code and local digits [22:55] I usually get spacing rules from all the examples on the local telco regulatory agency's website [22:55] and the spacing there should be a dash [22:55] but as I said, nobody does that [22:55] correct: 06-12345678 [22:55] correct: 010-1234567 [22:55] not examples, I mean, more like the "contact us" pages [22:55] correct: 0165-123456 [22:55] incorrect: 0165 - 12 34 56 [22:55] (but everybody does something like that) [22:56] hmm, thanks [22:56] xmc: I feel like I only made your job harder, by basically saying "there are no rules that anybody really follows" [22:56] lol [22:56] that's fine! [22:56] personally I'd suggest to do the following for spacing (which is not 'official' but sensible to most people): [22:56] 06 - 1234 5678 [22:56] 010 - 123 45 67 [22:57] 0165 - 12 34 56 [22:58] cool, I'll note that down for next time I touch that dataset [23:00] :) [23:00] thanks for the deets. local knowledge is sadly lacking from that dataset [23:00] * joepie91 still needs to implement phone number normalization in pythonwhois... [23:00] xmc: heh, I can imagine [23:00] local knowledge is tricky to come by really [23:00] the goal was mostly to get close enough that google indexer would be able to match to typed in phone numbers [23:01] hoe [23:01] hoe? [23:01] my domain name broke joepie91's whois lib ? [23:01] I might be willing to share my rewrite rules with pythonwhois, contact me when you start working on that [23:01] * nico is reading the backlog [23:01] nico: nah [23:01] my change did [23:01] I'm not sure how yet [23:01] :p [23:01] but it did [23:02] xmc: be aware that pythonwhois is licensed under the WTFPL, though :P [23:02] noted [23:02] my rewrite table is private and not yet licesenced to anyone [23:02] fucking laggy coffeeshop wifi [23:02] ah, you own the IP yourself? [23:02] yeah I compile it myself from many sources [23:02] I somehow assumed it was written for an employer of some sort [23:02] :P [23:03] also, new pythonwhois released! [23:03] nope, hobby project [23:03] ahh :) [23:03] I have odd hobbies [23:05] joepie91: can you try boissonnet-rousseau.com with your lib? [23:05] it has Whois Server: whois.ovh.com [23:06] OVH should be supported [23:06] seems to work fine [23:06] http://sprunge.us/UjRg [23:06] hm [23:06] the city and postal code are mixed up? [23:06] that's an odd one [23:06] aix en provence, 13100 [23:07] just a moment [23:07] let me check another ovh owned domain [23:07] yes [23:07] it is an ovh fuckup [23:07] same things on 1pacte.net [23:07] I have an OVH testcase [23:07] so unless it's a once-off fuckup [23:07] it's a problem in my testcase [23:07] lol [23:08] oh also. [23:08] postal addresses: don't even try [23:08] nico, from the testcase: 59053, Roubaix Cedex 1 [23:08] so this seems like an inconsistency within OVH, perhaps? [23:09] I went to school for geography and we had a few weeks in class where we talked about postal addresses [23:09] * nico check the domain of his clients [23:09] xmc: hehe [23:09] the best way to do postal addresses is a multiline text field [23:09] what? [23:09] don't even try to validate, it's actually impossible without sending a letter in the mail and seeing if it gets there [23:09] :-) [23:10] nico: I think it's inconsistent within OVH, not sure how to deal with that [23:10] littoral-bureautique.com is also city, postal code [23:10] fucking OVH [23:10] that's the third time they're giving me a headache [23:10] xmc: hehe [23:10] xmc: it is an invalid test [23:10] sounds like you should not be allowed to levy the Internet Death Penalty, because you'd probably use it everywhere [23:10] xmc: do postal codes always begin with a number? [23:10] haha [23:11] I read the other day a page about database programmers false assumptions of peoples name.. it was quite fun [23:11] in france you can send a mail to an totally wrong address [23:11] joepie91: no! canada they are e.g. V6A 2V8 [23:11] lol [23:11] oh goddamnit [23:11] and if the name is remotly associated to something laposte know [23:11] hahahahaha [23:11] let me ask this differently [23:11] does a postal code always contain a number [23:11] they will override the address [23:11] :) [23:11] joepie91: not necessarily [23:11] * joepie91 braces for "haha no" [23:11] joepie91: no [23:11] 13800 cedex 1 [23:11] xmc: okay, so how the fuck do I tell a postal code apart from a city [23:11] :) [23:11] "you don't" [23:11] joepie91: "exercise for the reader" [23:12] I know, but how do I do it anyway [23:12] lol [23:12] know what [23:12] heuristics time [23:12] en.wikipedia.org/wiki/List_of_postal_codes [23:12] if A contains numbers and B doesn't, or A is shorter than B, then A is probably the postal code [23:12] good enough for this corner case [23:13] possible. [23:13] but [23:13] too bad for people living in 4-letter-cities that happen to have services with OVH and have their data turned around [23:13] lol [23:13] I guess there can be cities that have numbers in their name.... bad city names [23:13] also short city names [23:13] morten77: don't make me think about this please [23:13] it's past midnight :( [23:13] xmc: the goal of pythonwhois is 100% format support, best-effort accuracy [23:14] aye [23:14] I'd say that my heuristics are probably good enough :P [23:14] not perfect [23:14] but good enough [23:14] yes, the problem is insoluble in the general case [23:14] I mean, really, how many people live in cities that have less than 5 characters AND have a number in the name [23:14] city: KonÃ© postal code: 98860 [23:15] nico: doesn't have a number in the city name [23:15] joepie91: sounds like you need to do a statistical survey of whois entries [23:15] so it'd declare kone to be the city [23:15] ah, AND [23:15] not quite [23:15] attempt 1: XOR for numbers [23:15] (that is, if one and only one has numbers, that's the postal) [23:15] attempt 2: if that fails, the shorter one is the postal code [23:16] xmc: honestly, there's already a few heuristics hacks like this in the normalization code, and it's working out fairly well [23:16] it looks like per http://en.wikipedia.org/wiki/List_of_postal_codes all postal code systems have at least one digit [23:16] AA NN for street addresses, AA AA for P.O. Box addresses. The second half of the postcode identifies the street delivery walk (e.g.: Hamilton HM 12) or the PO Box number range (e.g.: Hamilton HM BX). See Postal codes in Bermuda. [23:16] fuck you, Bermuda :( [23:17] oh right, AA AA [23:17] ick [23:17] well they can deal [23:17] might want to have "has a number or is all uppercase" [23:17] but how many people are using ovh from bermuda anyway [23:17] xmc: lots of registrars put EVERYTHING IN UPPER CASE [23:17] so yeah :P [23:18] but does ovh? [23:18] caps lock for the win [23:18] there's a reason that normalization code is there [23:18] xmc: on some registries, yes [23:18] icko [23:18] xmc: ovh is doing everything it can [23:18] to annoy you [23:18] multiple time [23:18] sounds like [23:19] here, joepie91: http://www.upu.int/en/activities/addressing/postal-addressing-systems-in-member-countries.html [23:20] * joepie91 doesn't want to click that [23:20] seriously though [23:20] "39. People whose names break my system are weird outliers.Â They should have had solid, acceptable names, like ç°ä¸å¤ªé." [23:20] nico: can you create a ticket on the github repo so that I don't forget [23:20] ah, the bermuda AA AA is for post office boxes only [23:20] and I'll implement the postal code heuristics tomorrow [23:21] preferably with a few test domains :P [23:23] * joepie91 prods nico [23:23] mm, need sleep [23:23] goodnight all :P [23:25] goodnight joepie91 [23:54] so, fixed one part only to kill the next one.. [23:55] this isnt going to be my night