[00:14] joepie91: he's not been in since i returned about a week ago [00:14] so i'm archiving techcrunch by month again [00:14] mostly cause my wifi sucks [00:15] it is doing better now at the moment [00:23] :/ [03:03] so 2010 episodes of HD Nation maybe going up soon [03:28] GOod. [04:32] g4tv.com-video43055: Sasha Grey Behind-the-Scenes Interview: https://archive.org/details/g4tv.com-video43055 [04:32] there is a also a flvhd version of that interview [04:33] g4 didn't have hd broadcast until april 2010 but [04:33] before that there was behind the scene interviews that were in hd [06:32] ping [06:33] dingaling [06:34] ok [06:35] just testing if i need to restart pidgin [08:31] I just had a really, really stupid idea [08:46] Is it possible to get a list of every file that textfiles.com links to. By that I mean a list of URLs for every publically viewable textfile on the site from the directory. [09:03] https://ia700608.us.archive.org/4/items/textfiles-dot-com-2011/MANIFEST.txt [09:04] while that is what I'm looking for ivan` I mainly wanted to focus on the textfiles...although I realise how hard that might be [09:06] I don't get it, do you want a subset of that manifest? [09:08] What I'm looking for is, from http://textfiles.com/directory.html, a list of textfiles and/or links to said textfiles viewable from said page [09:08] so every textfile from http://textfiles.com/100/, http://textfiles.com/adventure/ and so on [09:19] ffs [09:19] life takes up so much time sometimes ¬_¬ [09:19] pregnant wife + police statements + work == no time for smiley [09:21] you have a pregnant police work? [09:22] hehe [09:22] no, my wife is 7~ weeks pregnant and extremely tired all teh time [09:23] poor smiley [09:26] also SmileyG do you know of any way I could get a list like I just described a few lines ago or am I being farfetched :P [09:27] so my wifi is working fine right now [09:27] BlueMax: I'd advise asking SketchCow nicely [09:27] godane: \o/ [09:27] very well then, I shall ask SketchCow [09:28] i'm close to get all public 58xxx-flvhd files uploaded [09:40] dunno if SketchCow would like a PM or if he can just read everything here [10:04] i hope you guys have this ftp: http://mirrors.apple2.org.za/ [10:05] it has tons of apple 2 stuff from other ftp sites [10:34] i had to add 2 files here: http://archive.org/details/g4tv.com-video58984-flvhd [10:34] cause both are of 58984 video key [10:35] the tr_sing file is the public one on the website [10:35] where the other one was in the video clip by id xml dump i did [10:43] BlueMax: wouldn't it largely just be parsing the paths from that manifest? [10:44] Coderjoe, I have no idea how to do that to be honest :P [12:47] hmm [12:53] got my silly idea working, now to actually get the URL list [12:55] which I have no idea how to get. bugger. [12:55] What was the idea? [12:55] (server with ZNC on it died for once) [12:56] alright, this is gonna sound really stupid but I've had stupid ideas before and they've turned out well [12:56] I made a silly little set of "microgames" which one of which is selected at random [12:56] well, I only have one, but I plan to add more [12:57] and if you manage to successfully finish it, you get a random textfile from textfiles.com as a reward [12:58] That sounds rather interesting. [12:58] Let me guess, the URL list would be a list of text files? [12:58] I call it "textfiles the videogame". [12:58] Yes GLaDOS, just a big list of all the readable textfiles you can get to from the "directory" [12:58] and the game just picks one at random [12:58] and then it opens it in a new tab and resets the game so you can do it again if you wish. [12:59] (At least, that's the plan) [13:00] My mind jumped back to SketchCow saying "gamify it" on the subject of the tracker on the Warrior [13:00] and I thought "I wonder if I could do something with that and Textfiles.com" [13:00] and this is what I came up with [13:01] So yeah, I want a list of URLs of the readable textfiles on textfiles.com to make this happen, can probably do the rest myself. [13:01] Well, all you'd need to do to be able to compile a list of files would be to scrape the directory for any URLs that end in .txt [13:02] HAHA DISREGARD THAT I SUCK COCKS [13:02] Scrape for any file that doesn't start with [13:03] I was about to say something along the lines of that not working [13:03] But I'm not sure how to scrape for something like that [13:03] I'm a lowly Windows peasant [13:03] Ah, right. [13:04] cygwin [13:04] i could write a batch file to do it, but that'd only work if all the files were local. so that's not really very helpful at all.. erm.. I'll shut up now. [13:04] lol antomatic [13:04] ivan`, I have heard SketchCow personally slam cygwin for being...in fact I don't remember what he said but nevertheless! [13:05] I suppose I could head up the batchfile with 'wget textfiles.com' to make all the files local. :) [13:05] BlueMax: I use it every day, it works for most things I need [13:06] antomatic, I was planning on hosting this on my own server (as this is an HTML5 game) so I wanna try and avoid needing to upload too much data [13:06] ivan`, lol, to be perfectly honest I don't even know what it is [13:06] it gives you what you need, grep [13:06] and a lot of other utilities [13:08] so it's a Linux command line replacement? [13:08] yes [13:09] I see [13:10] there's supposed to be a way to get a random result from google.. don't know if you could get it to feed one from 'site:texfiles.com filetype:txt' or similar [13:11] Hm, that only brings up specific .txt files [13:11] bleh. disregard. [13:14] HAHA DISREGARD THAT I SUCK COCKS [13:15] Not sure I have time, I was too busy disregarding GlaDOS sucking cocks. :) [13:15] jk [13:15] Hmm [13:17] GLaDOS, any ideas? [13:17] None so far. [13:19] I mean for getting a URL list [13:19] Oh [13:19] I still think that crawling would be the best way. [13:20] I don't really know how. :P How would I go about doing that [13:21] Don't ask me, I suck at it! [13:21] You suck a lot of things apparently. :P [13:21] hue [13:22] would it be as simple as just grabbing the major indexes from http://web.textfiles.com and jamming them all together? [13:23] What do you mean, antomatic [13:25] I mean, if you want a list of all text files on web.textfiles.com, theoretically that would just be every file listed on the pages at web.textfiles.com/computers/ , web.textfiles.com/hacking/ , web.textfiles.com/humor/ , etc [13:25] web.textfiles.com/filestats.html - 7418 files in total [13:26] yeah, that's pretty much it, except just as full URLs to all of them. [13:28] seems doable [13:28] don't forget textfiles.com in general :P [13:29] not just web.textfiles.com [13:29] ah. [13:34] you can see a silly little test of the "game" at http://bluemaxima.org/thegame [13:35] it's set to open one textfile by default and it only has one minigame [13:36] but you get the general idea [13:45] should I wait to get SketchCow's opinion on this? [13:46] * SmileyG looks [13:46] thats errrm, random? [13:46] type a swear word to win? [13:48] I thought it fit the theme [13:48] There will be a few different games [13:48] Just made that one to make an easy test [13:49] k [13:49] so yeah [13:50] I'd like to get the collection of textfiles in there first before I continue [14:06] that looks like game maker [14:06] :P [14:11] joepie91, that IS game maker [14:11] :P [14:12] i'm uploaded up to episode 38 of HD Nation [14:12] heh [14:13] it works, so why not right :P [14:16] morning/afternoon [14:19] BlueMax: if you need to download textfiles.com you don't need to wget -r it [14:19] winr4r, I don't want to download it, I just want a list of URLs [14:19] BlueMax: oh, my bad [14:19] well if the backups of it on archive.org have the same directory structure [14:20] then find . | sed s/'^\.'/'http:\/\/www.textfiles.com'/ [14:21] I'm not on Linux [14:21] oh [14:37] also! [14:37] is there a reason that the links on the front page on the archive team wiki are absolute URL links [14:37] under "Archive Team News" [14:38] rather than [[links]] to the pages [14:40] (reason being: if you go to archiveteam.org, log in, then go to some of the links on the front page, they point to www.archiveteam.org and so you won't be logged in anymore) [14:47] winr4r: because when they were setup they weren't done properly :D [14:54] SmileyG: it could be for a reason though, like for machine-readability purposes [14:54] which is why i'm not suggesting that anyone fix it! [15:18] it's so cute when webdevs try to obfuscate/encode their output [15:18] to prevent scrapers [15:18] or even corrupt it [15:18] just makes it more fun for me to try and break it :D [15:18] (see also my last tweet) [15:20] you may be assuming they're obfuscating it to make life difficult [15:21] rather than to make it smaller [15:21] they're certainly obfuscating to make life difficult [15:21] I'm talking things like javascript packing [15:22] (a bunch of streaming sites do this to hide the video URL) [15:22] oh, gotcha [15:22] or omitting and closing tags [15:22] (binsearch.info does this) [15:22] they seem to actually expect that to stop scrapers lol [15:22] result: I wrote a Javascript unpacker in Python that can deal with any standard packed javascript blob [15:23] and just adjusted my regexes for binsearch [15:23] (not using an XML parser because slow) [15:23] excellent [15:23] actually [15:23] in case anyone ever needs to deobfuscate javascript [15:24] https://github.com/joepie91/resolv/blob/develop/resolv/shared.py#L108 [15:24] you're welcome :) [15:25] you're awesome! [15:26] sidenote: doesn't actually use JS [15:26] so you don't need a JS interpreter [15:26] I just reverse engineered the decoding thing [15:26] (not too hard with JS) [15:27] yup, back up [15:27] VPS not yet, but host node is [15:27] whoops [15:27] wrong channel [15:36] joepie91: parsing javascript with regexes? [15:48] joepie91, whenever someone says xml parsers are too slow, I know the person saying it is an idiot, you are now an idiot [15:48] winr4r: not exactly, just extracting the relevant bits [15:48] omf_: wat [15:50] joepie91> (not using an XML parser because slow) [15:50] yes, what about it? [15:50] if you're extracting a few tiny bits of info from a relatively complex page, then regular expressions are certainly faster than an XML parser [15:51] lol cat caught a mouse [15:52] using regex to parse html or xml in any instance is plain dumb because as veteran programmers know you cannot assume the result is going to be well formed or not change and regex are brittle when dealing with those issues. I used to do it like that 10 years ago before I learned there is a better way. [15:52] omf_, I am well aware of these issues [15:52] what you are forgetting however [15:52] is to take into account the context [15:52] omf_: "you cannot assume the result is going to be well formed" [15:52] yes, good luck getting a fucking XML parser to read some random HTML page then [15:53] considering the kind of sites I'm scraping, the general page format is unlikely to change due to any reason that is not 'intentionally try to break the scraper' [15:53] in which case, especially considering one of the sites INTENTIONALLY corrupts the HTML to accomplish this [15:53] this is faster/easier to work around when using regexes [15:53] than when using a (potentially VERY slow) parser [15:54] i'd use beautiful soup anytime [15:54] I am well aware of the reasons for using an XML parser, but those reasons simply do not apply in this particular case [15:54] winr4r, not a problem, good xml parsers are designed to handle broken shit, this was a solved problem years ago [15:54] but then i've found pages that break when you parse them with beautiful soup [15:54] winr4r: beautifulsoup is awesome, but incredibly slow compared to XML parsers that assume proper markup [15:54] fortunately, i fixed that with a regex! [15:55] and really [15:55] joepie91, whenever someone says xml parsers are too slow, I know the person saying it is an idiot, you are now an idiot [15:55] you probably shouldn't be making such contextless assumptions [15:56] i'm somewhat in agreement [15:57] dude I have read your code plenty of times, I have never been impressed [15:58] using regexes to parse HTML is a bad idea, if you can count on the HTML being consistent, well-structured, and not actually designed to break scrapers [15:58] regex are the most abused feature in modern programming because everyone thinks they are a solution [15:58] (bonus points: "xml parsers are too slow" in itself doesn't imply not using them or using regexes instead - I'm not quite sure how your reasoning for considering someone an idiot when saying that, in any way correlates with the statement you're judging on) [15:59] omf_: it's not exactly my goal to make you impressed with my code [15:59] I have 135gb of google video stuff from 2011, can I ul it somewhere? [15:59] my goal is to write software that is as reliable as feasible [15:59] and usable [15:59] in this particular case, that means using regex [16:00] balrog: mm, upload in what sense? [16:00] as in, eventually throw it on IA, or otherwise? [16:00] that is your opinion that regex is the best solution [16:00] yes [16:00] also, have any of you actually tested how fast parsing html with a real parser is? [16:00] using regexes will break easily when something changes [16:00] I've benchmarked using lxml vs using beautifulsoup vs using regex on similar kinds of data extraction before [16:00] balrog: ahem, so will an XML parser [16:01] balrog: that only applies when the site is likely to change with non-malicious intentions [16:01] bs4 supports several parsers [16:01] sure it'll give you a document tree, but *that document tree has no meaning* [16:01] that is not the case here [16:01] yes, I mean the internal thing it always used [16:01] point is, _if_ the site changes here [16:01] it's with the purpose of intentionally breaking scrapers [16:02] so there's really no point to using an XML parser over a regex, because they'll just break it in a different way [16:02] might be easier to fix with a parser [16:02] bs4 supports four parsers [16:02] some are slow and some are fast [16:02] as for google video btw, you can throw it on the storage box [16:02] yes, I know [16:02] those that can handle broken HTML are slow [16:02] I need to deal with broken HTML [16:03] yes, because broken html is a pain [16:03] balrog: not if you use a regex! ;) [16:03] joepie91, you make a lot of assumptions. "so there's really no point to using an XML parser over a regex, because they'll just break it in a different way" [16:03] supposedly html.parser has gotten better at broken html since python 2.7.3 / 3.2 [16:03] balrog: I need to assume 2.6 [16:03] bleh. [16:03] omf_, have you actually looked at what I'm scraping? [16:04] yes [16:04] have you actually looked at what the page source looks like? [16:04] and are you actually aware of their history of trying to break scrapers? [16:04] because I have, and I am [16:04] and that is what I have based my design decisions on [16:07] look, if you want to complain about someone that uses regex as hammer where everything is a nail, then go ahead [16:07] but then you shouldn't be directing it to me [16:08] I do actually use HTML/XML parsers when it is feasible and sensible to do so [16:08] (or other format-specific parsers) [16:08] i'm with joepie91 here [16:08] but the "regex is always the wrong solution for HTML" attitude is just as nonsensical as "regex is always the right solution for HTML" attitude [16:09] just on the other end of the scale [16:11] you missed the point I made earlier and chose not to address it. Check your blindspot [16:11] I'm not aware of any missed points. [16:11] could you quote it? [16:12] no, use the scrollback, actually read what was said [16:12] ... [16:12] omf_ [16:12] I have read every single line you said [16:12] as I always do [16:13] if I haven't responded to one, then I'm going to miss it again if I read back [16:13] because I am under the impression that I did [16:13] hence asking you to quote it [16:17] So, I uploaded about 300 "Architecture Books" from a collection. [16:17] Now, as we all know, the curatorial styles and abilities of some of these online curators can be... variant. [16:17] This one stands alone. [16:19] Their definition of "Architecture" included building design, building MATERIALS, house and home magazines, computer programming design, chip and systems design, and brochures for homes. [16:19] that sounds automated, SketchCow [16:20] especially considering the programming and chip/systems design [16:20] as if someone just grepped their library for everything with 'architecture' in the name [16:23] materials can be used to make design decisions [16:23] 1. strength [16:23] 2. looks [16:23] granite or plastic worktops. [16:24] SketchCow: splendid [16:25] So I can understand why its there (though worktops are a terrible example, that'd come under interior design. How about copper used as roofing material for the fact it changes colour over time?) [16:26] SmileyG, I got a lot of copper roof homes near where I live [16:27] i'm currently covered in fence protecting solution, I should shower [16:27] but first... A PHOTO! [16:29] yes [16:29] is it that new neverwet [16:31] hm [16:31] i've only once seen a copper-roofed building [16:31] newbury park tube station in london, it's rather splendid [16:38] joepie91: "MySQLdb does _not_ actually do parameterized queries, it just pretends to!" <- wat? [16:39] heh [16:39] yes [16:39] dig into the code [16:39] you'll notice that it actually just takes the values [16:39] escapes them [16:39] and string-concats them [16:39] it's stupid [16:39] joepie91: Any experience with SSL certificates? [16:39] joepie91: btw hi :) [16:39] it just does the parameterized thing because that's what DB-API says it should do :P [16:39] ohai norbert79 [16:39] not really, I barely use SSL [16:40] damn... I need to have a proper SSL certificate for my host, Godaddy.com offers one for $6 a year, but I wonder if anyone has technical experiences with them [16:40] anyway, winr4r, if you want to use MySQL in Python, just install oursql [16:40] pip install oursql (you'll need compile tools and libmysql-devel) [16:40] because that price smells... I mean it's too cheap to be proper [16:40] it uses a proper driver [16:41] and has all the fancy new features including param queries [16:41] norbert79: stay away from godaddy [16:41] regardless of what they offer [16:41] joepie91: oh, thanks :) [16:41] yeah, fuck godaddy [16:41] aside from that, I can't see how $6 is "too cheap to be proper" [16:41] it's really not much more than a promise [16:41] there is no real per-unit cost for an SSL cert [16:41] winr4r: No, it sounds good, but I wonder if they offer the signatures from root, or if they use sub-authorities [16:42] joepie91: Actually I just wish to cover my https pages with one valid and useful cert [16:42] joepie91: Right now I gave FreeSSL a try, looks ok so far, but they use subauthorities [16:42] startcom is the only usable free SSL provider that I am aware [16:42] joepie91: Sent a PM [16:42] of [16:43] freeSSL is the demo of RapidSSL [16:43] for a month [16:43] $6 sounds good, but I think there is a catch [16:43] It's just a feeling [16:44] it's godaddy, so yes, there'll probably be a catch buried somewhere [16:45] thanks for mentioning godady winr4r, it just reminded me I have to cancel that auction shit they signed me up for automatically [16:45] that was the catch for me last time [16:46] heh [16:47] I am now all namecheap [16:48] ok, so a domain name and SSL cert costs 10.42 EUR, ooor... [16:49] $13.5 [16:55] hmm, looks like this is the catch: http://i.imgur.com/ZupVnbU.jpg [16:55] it's not trusted by a main organization [16:55] but some sub-organization [16:56] If I understand this well [16:59] things that requests should implement: [17:00] 1. file downloading in chunks with one function [17:00] 2. binding a request to an IP [17:00] ? [17:01] python-requests [17:01] the library [17:01] and binding to an IP as in, you have multiple IPs [17:01] Ah, ok, sorry, not realted to my issue then [17:01] and you want a request to originate from a specific one [17:01] not related [17:01] just remarking [17:01] (currently monkeypatching support for this into requests...) [17:05] yay, I think I got it working [18:06] awesome. [18:06] I think they've blocked my scraper already [18:06] :< [18:06] _somehow_ [18:06] mysteriously getting HTTP 400s even on previously working NZBs [18:06] works fine via browser [18:07] joepie91: user agent? [18:10] wait, maybe not [18:10] it seems all POST requests fail [18:10] winr4r: nah, it randomly selects a legitimate useragent from a list [18:10] changing it didn't fix it [18:10] but one sec [18:11] okay, wow [18:11] I'm such an incredible idiot [18:11] lol [18:11] I was doing a get request in my post request wrapper [18:11] excellent [20:35] on the subject of BlueMax's game project: you can get a full list of urls from the manifest file from one of the textfiles.com IA items. the paths even start with the site name. [20:41] http://github.com/joepie91/nzbspider [20:41] done