#archiveteam-bs 2013-06-30,Sun

↑back Search

Time Nickname Message
00:14 🔗 winr4r joepie91: he's not been in since i returned about a week ago
00:14 🔗 godane so i'm archiving techcrunch by month again
00:14 🔗 godane mostly cause my wifi sucks
00:15 🔗 godane it is doing better now at the moment
00:23 🔗 joepie91 :/
03:03 🔗 godane so 2010 episodes of HD Nation maybe going up soon
03:28 🔗 SketchCow GOod.
04:32 🔗 godane g4tv.com-video43055: Sasha Grey Behind-the-Scenes Interview: https://archive.org/details/g4tv.com-video43055
04:32 🔗 godane there is a also a flvhd version of that interview
04:33 🔗 godane g4 didn't have hd broadcast until april 2010 but
04:33 🔗 godane before that there was behind the scene interviews that were in hd
06:32 🔗 godane ping
06:33 🔗 BlueMax dingaling
06:34 🔗 godane ok
06:35 🔗 godane just testing if i need to restart pidgin
08:31 🔗 BlueMax I just had a really, really stupid idea
08:46 🔗 BlueMax Is it possible to get a list of every file that textfiles.com links to. By that I mean a list of URLs for every publically viewable textfile on the site from the directory.
09:03 🔗 ivan` https://ia700608.us.archive.org/4/items/textfiles-dot-com-2011/MANIFEST.txt
09:04 🔗 BlueMax while that is what I'm looking for ivan` I mainly wanted to focus on the textfiles...although I realise how hard that might be
09:06 🔗 ivan` I don't get it, do you want a subset of that manifest?
09:08 🔗 BlueMax What I'm looking for is, from http://textfiles.com/directory.html, a list of textfiles and/or links to said textfiles viewable from said page
09:08 🔗 BlueMax so every textfile from http://textfiles.com/100/, http://textfiles.com/adventure/ and so on
09:19 🔗 SmileyG ffs
09:19 🔗 SmileyG life takes up so much time sometimes ¬_¬
09:19 🔗 SmileyG pregnant wife + police statements + work == no time for smiley
09:21 🔗 BlueMax you have a pregnant police work?
09:22 🔗 SmileyG hehe
09:22 🔗 SmileyG no, my wife is 7~ weeks pregnant and extremely tired all teh time
09:23 🔗 BlueMax poor smiley
09:26 🔗 BlueMax also SmileyG do you know of any way I could get a list like I just described a few lines ago or am I being farfetched :P
09:27 🔗 godane so my wifi is working fine right now
09:27 🔗 SmileyG BlueMax: I'd advise asking SketchCow nicely
09:27 🔗 SmileyG godane: \o/
09:27 🔗 BlueMax very well then, I shall ask SketchCow
09:28 🔗 godane i'm close to get all public 58xxx-flvhd files uploaded
09:40 🔗 BlueMax dunno if SketchCow would like a PM or if he can just read everything here
10:04 🔗 godane i hope you guys have this ftp: http://mirrors.apple2.org.za/
10:05 🔗 godane it has tons of apple 2 stuff from other ftp sites
10:34 🔗 godane i had to add 2 files here: http://archive.org/details/g4tv.com-video58984-flvhd
10:34 🔗 godane cause both are of 58984 video key
10:35 🔗 godane the tr_sing file is the public one on the website
10:35 🔗 godane where the other one was in the video clip by id xml dump i did
10:43 🔗 Coderjoe BlueMax: wouldn't it largely just be parsing the paths from that manifest?
10:44 🔗 BlueMax Coderjoe, I have no idea how to do that to be honest :P
12:47 🔗 BlueMax hmm
12:53 🔗 BlueMax got my silly idea working, now to actually get the URL list
12:55 🔗 BlueMax which I have no idea how to get. bugger.
12:55 🔗 GLaDOS What was the idea?
12:55 🔗 GLaDOS (server with ZNC on it died for once)
12:56 🔗 BlueMax alright, this is gonna sound really stupid but I've had stupid ideas before and they've turned out well
12:56 🔗 BlueMax I made a silly little set of "microgames" which one of which is selected at random
12:56 🔗 BlueMax well, I only have one, but I plan to add more
12:57 🔗 BlueMax and if you manage to successfully finish it, you get a random textfile from textfiles.com as a reward
12:58 🔗 GLaDOS That sounds rather interesting.
12:58 🔗 GLaDOS Let me guess, the URL list would be a list of text files?
12:58 🔗 BlueMax I call it "textfiles the videogame".
12:58 🔗 BlueMax Yes GLaDOS, just a big list of all the readable textfiles you can get to from the "directory"
12:58 🔗 BlueMax and the game just picks one at random
12:58 🔗 BlueMax and then it opens it in a new tab and resets the game so you can do it again if you wish.
12:59 🔗 BlueMax (At least, that's the plan)
13:00 🔗 BlueMax My mind jumped back to SketchCow saying "gamify it" on the subject of the tracker on the Warrior
13:00 🔗 BlueMax and I thought "I wonder if I could do something with that and Textfiles.com"
13:00 🔗 BlueMax and this is what I came up with
13:01 🔗 BlueMax So yeah, I want a list of URLs of the readable textfiles on textfiles.com to make this happen, can probably do the rest myself.
13:01 🔗 GLaDOS Well, all you'd need to do to be able to compile a list of files would be to scrape the directory for any URLs that end in .txt
13:02 🔗 GLaDOS HAHA DISREGARD THAT I SUCK COCKS
13:02 🔗 GLaDOS Scrape for any file that doesn't start with <HTML>
13:03 🔗 BlueMax I was about to say something along the lines of that not working
13:03 🔗 BlueMax But I'm not sure how to scrape for something like that
13:03 🔗 BlueMax I'm a lowly Windows peasant
13:03 🔗 GLaDOS Ah, right.
13:04 🔗 ivan` cygwin
13:04 🔗 antomatic i could write a batch file to do it, but that'd only work if all the files were local. so that's not really very helpful at all.. erm.. I'll shut up now.
13:04 🔗 BlueMax lol antomatic
13:04 🔗 BlueMax ivan`, I have heard SketchCow personally slam cygwin for being...in fact I don't remember what he said but nevertheless!
13:05 🔗 antomatic I suppose I could head up the batchfile with 'wget textfiles.com' to make all the files local. :)
13:05 🔗 ivan` BlueMax: I use it every day, it works for most things I need
13:06 🔗 BlueMax antomatic, I was planning on hosting this on my own server (as this is an HTML5 game) so I wanna try and avoid needing to upload too much data
13:06 🔗 BlueMax ivan`, lol, to be perfectly honest I don't even know what it is
13:06 🔗 ivan` it gives you what you need, grep
13:06 🔗 ivan` and a lot of other utilities
13:08 🔗 BlueMax so it's a Linux command line replacement?
13:08 🔗 ivan` yes
13:09 🔗 BlueMax I see
13:10 🔗 antomatic there's supposed to be a way to get a random result from google.. don't know if you could get it to feed one from 'site:texfiles.com filetype:txt' or similar
13:11 🔗 antomatic Hm, that only brings up specific .txt files
13:11 🔗 antomatic bleh. disregard.
13:14 🔗 BlueMax <antomatic> HAHA DISREGARD THAT I SUCK COCKS
13:15 🔗 antomatic Not sure I have time, I was too busy disregarding GlaDOS sucking cocks. :)
13:15 🔗 antomatic jk
13:15 🔗 BlueMax Hmm
13:17 🔗 BlueMax GLaDOS, any ideas?
13:17 🔗 GLaDOS None so far.
13:19 🔗 BlueMax I mean for getting a URL list
13:19 🔗 GLaDOS Oh
13:19 🔗 GLaDOS I still think that crawling would be the best way.
13:20 🔗 BlueMax I don't really know how. :P How would I go about doing that
13:21 🔗 GLaDOS Don't ask me, I suck at it!
13:21 🔗 BlueMax You suck a lot of things apparently. :P
13:21 🔗 GLaDOS hue
13:22 🔗 antomatic would it be as simple as just grabbing the major indexes from http://web.textfiles.com and jamming them all together?
13:23 🔗 BlueMax What do you mean, antomatic
13:25 🔗 antomatic I mean, if you want a list of all text files on web.textfiles.com, theoretically that would just be every file listed on the pages at web.textfiles.com/computers/ , web.textfiles.com/hacking/ , web.textfiles.com/humor/ , etc
13:25 🔗 antomatic web.textfiles.com/filestats.html - 7418 files in total
13:26 🔗 BlueMax yeah, that's pretty much it, except just as full URLs to all of them.
13:28 🔗 antomatic seems doable
13:28 🔗 BlueMax don't forget textfiles.com in general :P
13:29 🔗 BlueMax not just web.textfiles.com
13:29 🔗 antomatic ah.
13:34 🔗 BlueMax you can see a silly little test of the "game" at http://bluemaxima.org/thegame
13:35 🔗 BlueMax it's set to open one textfile by default and it only has one minigame
13:36 🔗 BlueMax but you get the general idea
13:45 🔗 BlueMax should I wait to get SketchCow's opinion on this?
13:46 🔗 * SmileyG looks
13:46 🔗 SmileyG thats errrm, random?
13:46 🔗 SmileyG type a swear word to win?
13:48 🔗 BlueMax I thought it fit the theme
13:48 🔗 BlueMax There will be a few different games
13:48 🔗 BlueMax Just made that one to make an easy test
13:49 🔗 SmileyG k
13:49 🔗 BlueMax so yeah
13:50 🔗 BlueMax I'd like to get the collection of textfiles in there first before I continue
14:06 🔗 joepie91 that looks like game maker
14:06 🔗 joepie91 :P
14:11 🔗 BlueMax joepie91, that IS game maker
14:11 🔗 BlueMax :P
14:12 🔗 godane i'm uploaded up to episode 38 of HD Nation
14:12 🔗 joepie91 heh
14:13 🔗 BlueMax it works, so why not right :P
14:16 🔗 winr4r morning/afternoon
14:19 🔗 winr4r BlueMax: if you need to download textfiles.com you don't need to wget -r it
14:19 🔗 BlueMax winr4r, I don't want to download it, I just want a list of URLs
14:19 🔗 winr4r BlueMax: oh, my bad
14:19 🔗 winr4r well if the backups of it on archive.org have the same directory structure
14:20 🔗 winr4r then find . | sed s/'^\.'/'http:\/\/www.textfiles.com'/
14:21 🔗 BlueMax I'm not on Linux
14:21 🔗 winr4r oh
14:37 🔗 winr4r also!
14:37 🔗 winr4r is there a reason that the links on the front page on the archive team wiki are absolute URL links
14:37 🔗 winr4r under "Archive Team News"
14:38 🔗 winr4r rather than [[links]] to the pages
14:40 🔗 winr4r (reason being: if you go to archiveteam.org, log in, then go to some of the links on the front page, they point to www.archiveteam.org and so you won't be logged in anymore)
14:47 🔗 SmileyG winr4r: because when they were setup they weren't done properly :D
14:54 🔗 winr4r SmileyG: it could be for a reason though, like for machine-readability purposes
14:54 🔗 winr4r which is why i'm not suggesting that anyone fix it!
15:18 🔗 joepie91 it's so cute when webdevs try to obfuscate/encode their output
15:18 🔗 joepie91 to prevent scrapers
15:18 🔗 joepie91 or even corrupt it
15:18 🔗 joepie91 just makes it more fun for me to try and break it :D
15:18 🔗 joepie91 (see also my last tweet)
15:20 🔗 winr4r you may be assuming they're obfuscating it to make life difficult
15:21 🔗 winr4r rather than to make it smaller
15:21 🔗 joepie91 they're certainly obfuscating to make life difficult
15:21 🔗 joepie91 I'm talking things like javascript packing
15:22 🔗 joepie91 (a bunch of streaming sites do this to hide the video URL)
15:22 🔗 winr4r oh, gotcha
15:22 🔗 joepie91 or omitting </tr> and </td> closing tags
15:22 🔗 joepie91 (binsearch.info does this)
15:22 🔗 joepie91 they seem to actually expect that to stop scrapers lol
15:22 🔗 joepie91 result: I wrote a Javascript unpacker in Python that can deal with any standard packed javascript blob
15:23 🔗 joepie91 and just adjusted my regexes for binsearch
15:23 🔗 joepie91 (not using an XML parser because slow)
15:23 🔗 winr4r excellent
15:23 🔗 joepie91 actually
15:23 🔗 joepie91 in case anyone ever needs to deobfuscate javascript
15:24 🔗 joepie91 https://github.com/joepie91/resolv/blob/develop/resolv/shared.py#L108
15:24 🔗 joepie91 you're welcome :)
15:25 🔗 winr4r you're awesome!
15:26 🔗 joepie91 sidenote: doesn't actually use JS
15:26 🔗 joepie91 so you don't need a JS interpreter
15:26 🔗 joepie91 I just reverse engineered the decoding thing
15:26 🔗 joepie91 (not too hard with JS)
15:27 🔗 joepie91 yup, back up
15:27 🔗 joepie91 VPS not yet, but host node is
15:27 🔗 joepie91 whoops
15:27 🔗 joepie91 wrong channel
15:36 🔗 winr4r joepie91: parsing javascript with regexes?
15:48 🔗 omf_ joepie91, whenever someone says xml parsers are too slow, I know the person saying it is an idiot, you are now an idiot
15:48 🔗 joepie91 winr4r: not exactly, just extracting the relevant bits
15:48 🔗 joepie91 omf_: wat
15:50 🔗 omf_ joepie91> (not using an XML parser because slow)
15:50 🔗 joepie91 yes, what about it?
15:50 🔗 joepie91 if you're extracting a few tiny bits of info from a relatively complex page, then regular expressions are certainly faster than an XML parser
15:51 🔗 joepie91 lol cat caught a mouse
15:52 🔗 omf_ using regex to parse html or xml in any instance is plain dumb because as veteran programmers know you cannot assume the result is going to be well formed or not change and regex are brittle when dealing with those issues. I used to do it like that 10 years ago before I learned there is a better way.
15:52 🔗 joepie91 omf_, I am well aware of these issues
15:52 🔗 joepie91 what you are forgetting however
15:52 🔗 joepie91 is to take into account the context
15:52 🔗 winr4r omf_: "you cannot assume the result is going to be well formed"
15:52 🔗 winr4r yes, good luck getting a fucking XML parser to read some random HTML page then
15:53 🔗 joepie91 considering the kind of sites I'm scraping, the general page format is unlikely to change due to any reason that is not 'intentionally try to break the scraper'
15:53 🔗 joepie91 in which case, especially considering one of the sites INTENTIONALLY corrupts the HTML to accomplish this
15:53 🔗 joepie91 this is faster/easier to work around when using regexes
15:53 🔗 joepie91 than when using a (potentially VERY slow) parser
15:54 🔗 winr4r i'd use beautiful soup anytime
15:54 🔗 joepie91 I am well aware of the reasons for using an XML parser, but those reasons simply do not apply in this particular case
15:54 🔗 omf_ winr4r, not a problem, good xml parsers are designed to handle broken shit, this was a solved problem years ago
15:54 🔗 winr4r but then i've found pages that break when you parse them with beautiful soup
15:54 🔗 joepie91 winr4r: beautifulsoup is awesome, but incredibly slow compared to XML parsers that assume proper markup
15:54 🔗 winr4r fortunately, i fixed that with a regex!
15:55 🔗 joepie91 and really
15:55 🔗 joepie91 <omf_>joepie91, whenever someone says xml parsers are too slow, I know the person saying it is an idiot, you are now an idiot
15:55 🔗 joepie91 you probably shouldn't be making such contextless assumptions
15:56 🔗 winr4r i'm somewhat in agreement
15:57 🔗 omf_ dude I have read your code plenty of times, I have never been impressed
15:58 🔗 winr4r using regexes to parse HTML is a bad idea, if you can count on the HTML being consistent, well-structured, and not actually designed to break scrapers
15:58 🔗 omf_ regex are the most abused feature in modern programming because everyone thinks they are a solution
15:58 🔗 joepie91 (bonus points: "xml parsers are too slow" in itself doesn't imply not using them or using regexes instead - I'm not quite sure how your reasoning for considering someone an idiot when saying that, in any way correlates with the statement you're judging on)
15:59 🔗 joepie91 omf_: it's not exactly my goal to make you impressed with my code
15:59 🔗 balrog I have 135gb of google video stuff from 2011, can I ul it somewhere?
15:59 🔗 joepie91 my goal is to write software that is as reliable as feasible
15:59 🔗 joepie91 and usable
15:59 🔗 joepie91 in this particular case, that means using regex
16:00 🔗 joepie91 balrog: mm, upload in what sense?
16:00 🔗 joepie91 as in, eventually throw it on IA, or otherwise?
16:00 🔗 omf_ that is your opinion that regex is the best solution
16:00 🔗 balrog yes
16:00 🔗 balrog also, have any of you actually tested how fast parsing html with a real parser is?
16:00 🔗 balrog using regexes will break easily when something changes
16:00 🔗 joepie91 I've benchmarked using lxml vs using beautifulsoup vs using regex on similar kinds of data extraction before
16:00 🔗 winr4r balrog: ahem, so will an XML parser
16:01 🔗 joepie91 balrog: that only applies when the site is likely to change with non-malicious intentions
16:01 🔗 balrog bs4 supports several parsers
16:01 🔗 winr4r sure it'll give you a document tree, but *that document tree has no meaning*
16:01 🔗 joepie91 that is not the case here
16:01 🔗 joepie91 yes, I mean the internal thing it always used
16:01 🔗 joepie91 point is, _if_ the site changes here
16:01 🔗 joepie91 it's with the purpose of intentionally breaking scrapers
16:02 🔗 joepie91 so there's really no point to using an XML parser over a regex, because they'll just break it in a different way
16:02 🔗 balrog might be easier to fix with a parser
16:02 🔗 balrog bs4 supports four parsers
16:02 🔗 balrog some are slow and some are fast
16:02 🔗 joepie91 as for google video btw, you can throw it on the storage box
16:02 🔗 joepie91 yes, I know
16:02 🔗 joepie91 those that can handle broken HTML are slow
16:02 🔗 joepie91 I need to deal with broken HTML
16:03 🔗 balrog yes, because broken html is a pain
16:03 🔗 winr4r balrog: not if you use a regex! ;)
16:03 🔗 omf_ joepie91, you make a lot of assumptions. "so there's really no point to using an XML parser over a regex, because they'll just break it in a different way"
16:03 🔗 balrog supposedly html.parser has gotten better at broken html since python 2.7.3 / 3.2
16:03 🔗 joepie91 balrog: I need to assume 2.6
16:03 🔗 balrog bleh.
16:03 🔗 joepie91 omf_, have you actually looked at what I'm scraping?
16:04 🔗 omf_ yes
16:04 🔗 joepie91 have you actually looked at what the page source looks like?
16:04 🔗 joepie91 and are you actually aware of their history of trying to break scrapers?
16:04 🔗 joepie91 because I have, and I am
16:04 🔗 joepie91 and that is what I have based my design decisions on
16:07 🔗 joepie91 look, if you want to complain about someone that uses regex as hammer where everything is a nail, then go ahead
16:07 🔗 joepie91 but then you shouldn't be directing it to me
16:08 🔗 joepie91 I do actually use HTML/XML parsers when it is feasible and sensible to do so
16:08 🔗 joepie91 (or other format-specific parsers)
16:08 🔗 winr4r i'm with joepie91 here
16:08 🔗 joepie91 but the "regex is always the wrong solution for HTML" attitude is just as nonsensical as "regex is always the right solution for HTML" attitude
16:09 🔗 joepie91 just on the other end of the scale
16:11 🔗 omf_ you missed the point I made earlier and chose not to address it. Check your blindspot
16:11 🔗 joepie91 I'm not aware of any missed points.
16:11 🔗 joepie91 could you quote it?
16:12 🔗 omf_ no, use the scrollback, actually read what was said
16:12 🔗 joepie91 ...
16:12 🔗 joepie91 omf_
16:12 🔗 joepie91 I have read every single line you said
16:12 🔗 joepie91 as I always do
16:13 🔗 joepie91 if I haven't responded to one, then I'm going to miss it again if I read back
16:13 🔗 joepie91 because I am under the impression that I did
16:13 🔗 joepie91 hence asking you to quote it
16:17 🔗 SketchCow So, I uploaded about 300 "Architecture Books" from a collection.
16:17 🔗 SketchCow Now, as we all know, the curatorial styles and abilities of some of these online curators can be... variant.
16:17 🔗 SketchCow This one stands alone.
16:19 🔗 SketchCow Their definition of "Architecture" included building design, building MATERIALS, house and home magazines, computer programming design, chip and systems design, and brochures for homes.
16:19 🔗 joepie91 that sounds automated, SketchCow
16:20 🔗 joepie91 especially considering the programming and chip/systems design
16:20 🔗 joepie91 as if someone just grepped their library for everything with 'architecture' in the name
16:23 🔗 SmileyG materials can be used to make design decisions
16:23 🔗 SmileyG 1. strength
16:23 🔗 SmileyG 2. looks
16:23 🔗 SmileyG granite or plastic worktops.
16:24 🔗 winr4r SketchCow: splendid
16:25 🔗 SmileyG So I can understand why its there (though worktops are a terrible example, that'd come under interior design. How about copper used as roofing material for the fact it changes colour over time?)
16:26 🔗 omf_ SmileyG, I got a lot of copper roof homes near where I live
16:27 🔗 SmileyG i'm currently covered in fence protecting solution, I should shower
16:27 🔗 SmileyG but first... A PHOTO!
16:29 🔗 omf_ yes
16:29 🔗 omf_ is it that new neverwet
16:31 🔗 winr4r hm
16:31 🔗 winr4r i've only once seen a copper-roofed building
16:31 🔗 winr4r newbury park tube station in london, it's rather splendid
16:38 🔗 winr4r joepie91: "MySQLdb does _not_ actually do parameterized queries, it just pretends to!" <- wat?
16:39 🔗 joepie91 heh
16:39 🔗 joepie91 yes
16:39 🔗 joepie91 dig into the code
16:39 🔗 joepie91 you'll notice that it actually just takes the values
16:39 🔗 joepie91 escapes them
16:39 🔗 joepie91 and string-concats them
16:39 🔗 joepie91 it's stupid
16:39 🔗 norbert80 joepie91: Any experience with SSL certificates?
16:39 🔗 norbert79 joepie91: btw hi :)
16:39 🔗 joepie91 it just does the parameterized thing because that's what DB-API says it should do :P
16:39 🔗 joepie91 ohai norbert79
16:39 🔗 joepie91 not really, I barely use SSL
16:40 🔗 norbert79 damn... I need to have a proper SSL certificate for my host, Godaddy.com offers one for $6 a year, but I wonder if anyone has technical experiences with them
16:40 🔗 joepie91 anyway, winr4r, if you want to use MySQL in Python, just install oursql
16:40 🔗 joepie91 pip install oursql (you'll need compile tools and libmysql-devel)
16:40 🔗 norbert79 because that price smells... I mean it's too cheap to be proper
16:40 🔗 joepie91 it uses a proper driver
16:41 🔗 joepie91 and has all the fancy new features including param queries
16:41 🔗 joepie91 norbert79: stay away from godaddy
16:41 🔗 joepie91 regardless of what they offer
16:41 🔗 winr4r joepie91: oh, thanks :)
16:41 🔗 winr4r yeah, fuck godaddy
16:41 🔗 joepie91 aside from that, I can't see how $6 is "too cheap to be proper"
16:41 🔗 joepie91 it's really not much more than a promise
16:41 🔗 joepie91 there is no real per-unit cost for an SSL cert
16:41 🔗 norbert79 winr4r: No, it sounds good, but I wonder if they offer the signatures from root, or if they use sub-authorities
16:42 🔗 norbert79 joepie91: Actually I just wish to cover my https pages with one valid and useful cert
16:42 🔗 norbert79 joepie91: Right now I gave FreeSSL a try, looks ok so far, but they use subauthorities
16:42 🔗 joepie91 startcom is the only usable free SSL provider that I am aware
16:42 🔗 norbert79 joepie91: Sent a PM
16:42 🔗 joepie91 of
16:43 🔗 norbert79 freeSSL is the demo of RapidSSL
16:43 🔗 norbert79 for a month
16:43 🔗 norbert79 $6 sounds good, but I think there is a catch
16:43 🔗 norbert79 It's just a feeling
16:44 🔗 winr4r it's godaddy, so yes, there'll probably be a catch buried somewhere
16:45 🔗 omf_ thanks for mentioning godady winr4r, it just reminded me I have to cancel that auction shit they signed me up for automatically
16:45 🔗 omf_ that was the catch for me last time
16:46 🔗 norbert79 heh
16:47 🔗 omf_ I am now all namecheap
16:48 🔗 norbert79 ok, so a domain name and SSL cert costs 10.42 EUR, ooor...
16:49 🔗 norbert79 $13.5
16:55 🔗 norbert79 hmm, looks like this is the catch: http://i.imgur.com/ZupVnbU.jpg
16:55 🔗 norbert79 it's not trusted by a main organization
16:55 🔗 norbert79 but some sub-organization
16:56 🔗 norbert79 If I understand this well
16:59 🔗 joepie91 things that requests should implement:
17:00 🔗 joepie91 1. file downloading in chunks with one function
17:00 🔗 joepie91 2. binding a request to an IP
17:00 🔗 norbert79 ?
17:01 🔗 joepie91 python-requests
17:01 🔗 joepie91 the library
17:01 🔗 joepie91 and binding to an IP as in, you have multiple IPs
17:01 🔗 norbert79 Ah, ok, sorry, not realted to my issue then
17:01 🔗 joepie91 and you want a request to originate from a specific one
17:01 🔗 joepie91 not related
17:01 🔗 joepie91 just remarking
17:01 🔗 joepie91 (currently monkeypatching support for this into requests...)
17:05 🔗 joepie91 yay, I think I got it working
18:06 🔗 joepie91 awesome.
18:06 🔗 joepie91 I think they've blocked my scraper already
18:06 🔗 winr4r :<
18:06 🔗 joepie91 _somehow_
18:06 🔗 joepie91 mysteriously getting HTTP 400s even on previously working NZBs
18:06 🔗 joepie91 works fine via browser
18:07 🔗 winr4r joepie91: user agent?
18:10 🔗 joepie91 wait, maybe not
18:10 🔗 joepie91 it seems all POST requests fail
18:10 🔗 joepie91 winr4r: nah, it randomly selects a legitimate useragent from a list
18:10 🔗 joepie91 changing it didn't fix it
18:10 🔗 joepie91 but one sec
18:11 🔗 joepie91 okay, wow
18:11 🔗 joepie91 I'm such an incredible idiot
18:11 🔗 joepie91 lol
18:11 🔗 joepie91 I was doing a get request in my post request wrapper
18:11 🔗 winr4r excellent
20:35 🔗 Coderjoe on the subject of BlueMax's game project: you can get a full list of urls from the manifest file from one of the textfiles.com IA items. the paths even start with the site name.
20:41 🔗 joepie91 http://github.com/joepie91/nzbspider
20:41 🔗 joepie91 done

irclogger-viewer