#warrior 2014-07-07,Mon

↑back Search

Time Nickname Message
03:13 🔗 yuri_seva Hey guys, I'm looking for some details on the warrior scraping API, capabilities, etc...
03:13 🔗 yuri_seva The reason I'm asking is because I'm working on something similar, armed at complex sites like facebook, etc... javascript/dom intensive stuffs :)
03:14 🔗 yuri_seva (Based on Qt!)
03:14 🔗 yuri_seva ^^ But I figured I'd reach out to the community and see if there's any common ground. I love what you're doing!
03:16 🔗 chfoo yuri_seva: for hyves project, the javascript stuff was all analyzed in the browser and most of the requests were hand crafted with lots of regex and scripting in wget-lua
03:17 🔗 chfoo as far as i know, we haven't used any webkit based solutions
03:18 🔗 yuri_seva Yeah, that's what I'm curious about. Qt >= 5.3 has a prototype project based on chromium. Combine that with the new "offscreen" platform plugin, and suddenly you can actually execute javascript in headless c++
03:19 🔗 yuri_seva I've been testing it with facebook, getting really neat results
03:20 🔗 chfoo yuri_seva: we are definitely curious about handling javascript. archivebot has phantomjs wedged into it but it's far from perfect.
03:22 🔗 chfoo but generally, warrior projects are very specific and there hasn't been any serious issues with javascript yet
03:24 🔗 chfoo however the warrior allows running arbitrary binaries so anything that works in debian 6 can be used
03:27 🔗 yuri_seva I'll have to try digging through the scripts in the vm image. Trying to make easier ways of extracting model data from forums, blogs, etc.
03:28 🔗 yuri_seva Here's a tool I wrote... based on some earlier work with Qt
03:28 🔗 yuri_seva https://github.com/yuri-sevatz/libfacebook/blob/master/fbhack/fbhack.cpp
03:28 🔗 yuri_seva ^^ They don't let you do that in their TOS ;)
03:28 🔗 yuri_seva The cool thing is, I have no idea what half the code in their site does... but I can just drive it without paying attention
03:29 🔗 yuri_seva and their xml is scary!
03:30 🔗 yuri_seva And this is actually all I had to do for full-blown login/integration
03:30 🔗 yuri_seva https://github.com/yuri-sevatz/libfacebook/blob/master/libfacebook/clientprivate.cpp
03:30 🔗 yuri_seva Anyway, do you guys need anything like that?
03:31 🔗 yuri_seva Like... tools to make these easier, or more playful
03:34 🔗 db48x yuri_seva: yes
03:35 🔗 yuri_seva Specifically for things like asynchronous DOM loading, high-level javascript execution, CSS2 selectors... things to make our lives more fun :)
03:37 🔗 yuri_seva lol nice, Okay, uhmm... let me spend some time going through your toolkit. I'll try and get a feel for what kind of primitives the framework lets you get and set
03:38 🔗 db48x yuri_seva: the warrior code only deals with the outer process of starting a job, running the steps of that job and running them concurrently with each other
03:39 🔗 db48x the steps of the job are generally things like making a directory to hold the archived data, running wget or a similar tool with the right options to archive something, then rsyncing that somewhere
03:40 🔗 yuri_seva yeah I think it's the wget (and below) part that this part entails
03:41 🔗 yuri_seva The actual extraction part
03:42 🔗 yuri_seva i caught wind of some of the modified versions of wget
03:45 🔗 chfoo well, something that is missing from wget is the javascript/dom portion of the web browser stack. if there was a python/ruby library that allowed you to put in html/js and have it spit back out resources requested/dom modifications/cookie changes, that would be really useful.
03:48 🔗 db48x chfoo: interaction is a big part of it though
03:48 🔗 yuri_seva Yeah it's a bit of a neat thing... you can kind of get away with loading html/javascript from a text blob, but it still forces you to create a context (Sort of saying... "where the page came from, what's it relative to, etc")
03:48 🔗 yuri_seva Once you have that, you can access it just like a user, but in code
03:49 🔗 db48x https://github.com/yuri-sevatz/libfacebook/blob/master/libfacebook/clientprivate.cpp#L97-100
03:49 🔗 yuri_seva Exactly! -- and the javascript execution after, things like "this.click()"
03:50 🔗 yuri_seva That lets you easily take on complex companies like google and facebook
03:50 🔗 yuri_seva Now... I'm trying to find a way to wrap this (either with a library, or CLI) -- to make maintaining web template interactions easier
03:54 🔗 yuri_seva So that when you say "give be this thread", you can get back a formatted set/dictionary of data, sort of specific to a model you define (on a per-site basis)
03:54 🔗 * db48x nods
03:55 🔗 db48x we also like to capture the http requests themselves in a WARC
03:55 🔗 db48x since that lets us reproduce the exact site
03:55 🔗 yuri_seva QNetworkAccessManager -- each browser instance comes with one
03:56 🔗 yuri_seva you can MITM or listen to everything in and out
03:56 🔗 yuri_seva I use similar tricks to find out how busy the AJAX is ;)
03:57 🔗 yuri_seva There's other cool features too... recently I just found out how to turn on/off media loading
03:58 🔗 yuri_seva so I can go "save bandwidth", or "stealth" to reduce the digital fingerprint, lol
04:00 🔗 yuri_seva Oh... and you'll like this
04:00 🔗 yuri_seva I did some javascript injection too!
04:00 🔗 yuri_seva To modify complex pages while they're running, and hook javascript events
04:01 🔗 yuri_seva Then get the results back in c++ ... very cool stuff
04:02 🔗 yuri_seva I'll dig around more, need to get up to speed on what you have/need... and we'll try and get something fun for the toolkit
04:02 🔗 yuri_seva This looks like a good place to start 
04:02 🔗 yuri_seva http://www.archiveteam.org/index.php?title=Dev/Source_Code
04:02 🔗 yuri_seva :)
04:04 🔗 yuri_seva Btw, anybody here going to defcon this year?
04:13 🔗 yuri_seva (In Vegas!)
04:16 🔗 db48x probably
04:16 🔗 db48x not I though
04:16 🔗 db48x and yep, that's the right page
04:26 🔗 db48x I think we would probably use a proxy server to create the warc
04:27 🔗 db48x mostly because it already exists, but also because it means it's easier to share across projects
04:28 🔗 db48x on the other hand it means we'd want to be able to configure the proxy your program uses at runtime :)
04:29 🔗 db48x (although lxc containers would eliminate even that...)
04:30 🔗 yuri_seva Hmm, that's a neat idea. I was looking at the callbacks in the wget lua api, I can see why this helps you create the warc file
04:31 🔗 yuri_seva https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks#callback-download_child_p
04:31 🔗 yuri_seva But I suspect the API Qt gives us hands out enough information to see all the requests
04:32 🔗 db48x yes, I'm sure it would
04:34 🔗 yuri_seva It might be useful initially if we get it to be more of a compliment to the wget-lua toolkit
04:35 🔗 yuri_seva In my experience a headless browser is well-suited for DOM interaction, still a lot of lifting to do if you want to say, censor network access, or parse certain domains only
04:36 🔗 yuri_seva because that has a lot of implications on the browser stack
04:36 🔗 yuri_seva (Like, simple example, if you censor jquery... that's a big impact!)
04:38 🔗 yuri_seva They sort of made it to be simple, but, with that comes the overhead of driving a lot of ajax and media.
04:39 🔗 yuri_seva Perfect metaphor: a LARGE boat =P
04:41 🔗 yuri_seva Maybe you'd do something like, loading a complex page, then handing the DOM elements of interest to a database, or for wget-lua for the actual saving
04:41 🔗 yuri_seva ^^ Say, trying to mirror youtube :)
04:41 🔗 Coderjoe wtf
04:41 🔗 Coderjoe two processors for a single drive?
04:41 🔗 Coderjoe er
04:41 🔗 Coderjoe channel
04:42 🔗 yuri_seva Well, kind of a pipeline, you'd load one into the browser stack to execute javascript if there's say, complex data you need to operate on
04:42 🔗 yuri_seva And then put the results somewhere else
04:43 🔗 yuri_seva (This is useful for a lot of sites that use MVC a lot)
04:44 🔗 yuri_seva The actual saving in the browser stack isn't nearly as detailed... it's just high level so it's an easier ship to steer
04:45 🔗 yuri_seva Your wget will always be more efficient, the stack is just good for more complex data extraction, or more complex UI interactions wherein you only care about learning something before saving
04:46 🔗 yuri_seva ^^ I might be wrong about the best way to use such a thing, but that's just from what I've sort of seen
04:48 🔗 yuri_seva Obviously you'll never be able to share the connection between the "two processors", but, that's just because a high-level browser stack quite a bit boat to multitask :)
04:48 🔗 yuri_seva You'd never have it pull the same performance as say, wget
04:51 🔗 yuri_seva Good sites where this is useful...
04:51 🔗 yuri_seva Imgur, photobucket... facebook... anybody that uses nasty amounds of javascript that's subject to change a lot
04:53 🔗 yuri_seva Or, MEGA... because they have a lot of js!
04:54 🔗 yuri_seva Anyway, gotta sleep, I'll think on this. Seems that there's a lot of rule sets to try and merge. Maybe possible one day!
04:56 🔗 yuri_seva Thanks for entertaining the idea!

irclogger-viewer