#warrior 2014-07-07,Mon

↑back Search

Time	Nickname	Message
03:13 ^🔗	yuri_seva	Hey guys, I'm looking for some details on the warrior scraping API, capabilities, etc...
03:13 ^🔗	yuri_seva	The reason I'm asking is because I'm working on something similar, armed at complex sites like facebook, etc... javascript/dom intensive stuffs :)
03:14 ^🔗	yuri_seva	(Based on Qt!)
03:14 ^🔗	yuri_seva	^^ But I figured I'd reach out to the community and see if there's any common ground. I love what you're doing!
03:16 ^🔗	chfoo	yuri_seva: for hyves project, the javascript stuff was all analyzed in the browser and most of the requests were hand crafted with lots of regex and scripting in wget-lua
03:17 ^🔗	chfoo	as far as i know, we haven't used any webkit based solutions
03:18 ^🔗	yuri_seva	Yeah, that's what I'm curious about. Qt >= 5.3 has a prototype project based on chromium. Combine that with the new "offscreen" platform plugin, and suddenly you can actually execute javascript in headless c++
03:19 ^🔗	yuri_seva	I've been testing it with facebook, getting really neat results
03:20 ^🔗	chfoo	yuri_seva: we are definitely curious about handling javascript. archivebot has phantomjs wedged into it but it's far from perfect.
03:22 ^🔗	chfoo	but generally, warrior projects are very specific and there hasn't been any serious issues with javascript yet
03:24 ^🔗	chfoo	however the warrior allows running arbitrary binaries so anything that works in debian 6 can be used
03:27 ^🔗	yuri_seva	I'll have to try digging through the scripts in the vm image. Trying to make easier ways of extracting model data from forums, blogs, etc.
03:28 ^🔗	yuri_seva	Here's a tool I wrote... based on some earlier work with Qt
03:28 ^🔗	yuri_seva	https://github.com/yuri-sevatz/libfacebook/blob/master/fbhack/fbhack.cpp
03:28 ^🔗	yuri_seva	^^ They don't let you do that in their TOS ;)
03:28 ^🔗	yuri_seva	The cool thing is, I have no idea what half the code in their site does... but I can just drive it without paying attention
03:29 ^🔗	yuri_seva	and their xml is scary!
03:30 ^🔗	yuri_seva	And this is actually all I had to do for full-blown login/integration
03:30 ^🔗	yuri_seva	https://github.com/yuri-sevatz/libfacebook/blob/master/libfacebook/clientprivate.cpp
03:30 ^🔗	yuri_seva	Anyway, do you guys need anything like that?
03:31 ^🔗	yuri_seva	Like... tools to make these easier, or more playful
03:34 ^🔗	db48x	yuri_seva: yes
03:35 ^🔗	yuri_seva	Specifically for things like asynchronous DOM loading, high-level javascript execution, CSS2 selectors... things to make our lives more fun :)
03:37 ^🔗	yuri_seva	lol nice, Okay, uhmm... let me spend some time going through your toolkit. I'll try and get a feel for what kind of primitives the framework lets you get and set
03:38 ^🔗	db48x	yuri_seva: the warrior code only deals with the outer process of starting a job, running the steps of that job and running them concurrently with each other
03:39 ^🔗	db48x	the steps of the job are generally things like making a directory to hold the archived data, running wget or a similar tool with the right options to archive something, then rsyncing that somewhere
03:40 ^🔗	yuri_seva	yeah I think it's the wget (and below) part that this part entails
03:41 ^🔗	yuri_seva	The actual extraction part
03:42 ^🔗	yuri_seva	i caught wind of some of the modified versions of wget
03:45 ^🔗	chfoo	well, something that is missing from wget is the javascript/dom portion of the web browser stack. if there was a python/ruby library that allowed you to put in html/js and have it spit back out resources requested/dom modifications/cookie changes, that would be really useful.
03:48 ^🔗	db48x	chfoo: interaction is a big part of it though
03:48 ^🔗	yuri_seva	Yeah it's a bit of a neat thing... you can kind of get away with loading html/javascript from a text blob, but it still forces you to create a context (Sort of saying... "where the page came from, what's it relative to, etc")
03:48 ^🔗	yuri_seva	Once you have that, you can access it just like a user, but in code
03:49 ^🔗	db48x	https://github.com/yuri-sevatz/libfacebook/blob/master/libfacebook/clientprivate.cpp#L97-100
03:49 ^🔗	yuri_seva	Exactly! -- and the javascript execution after, things like "this.click()"
03:50 ^🔗	yuri_seva	That lets you easily take on complex companies like google and facebook
03:50 ^🔗	yuri_seva	Now... I'm trying to find a way to wrap this (either with a library, or CLI) -- to make maintaining web template interactions easier
03:54 ^🔗	yuri_seva	So that when you say "give be this thread", you can get back a formatted set/dictionary of data, sort of specific to a model you define (on a per-site basis)
03:54 ^🔗	*	db48x nods
03:55 ^🔗	db48x	we also like to capture the http requests themselves in a WARC
03:55 ^🔗	db48x	since that lets us reproduce the exact site
03:55 ^🔗	yuri_seva	QNetworkAccessManager -- each browser instance comes with one
03:56 ^🔗	yuri_seva	you can MITM or listen to everything in and out
03:56 ^🔗	yuri_seva	I use similar tricks to find out how busy the AJAX is ;)
03:57 ^🔗	yuri_seva	There's other cool features too... recently I just found out how to turn on/off media loading
03:58 ^🔗	yuri_seva	so I can go "save bandwidth", or "stealth" to reduce the digital fingerprint, lol
04:00 ^🔗	yuri_seva	Oh... and you'll like this
04:00 ^🔗	yuri_seva	I did some javascript injection too!
04:00 ^🔗	yuri_seva	To modify complex pages while they're running, and hook javascript events
04:01 ^🔗	yuri_seva	Then get the results back in c++ ... very cool stuff
04:02 ^🔗	yuri_seva	I'll dig around more, need to get up to speed on what you have/need... and we'll try and get something fun for the toolkit
04:02 ^🔗	yuri_seva	This looks like a good place to start
04:02 ^🔗	yuri_seva	http://www.archiveteam.org/index.php?title=Dev/Source_Code
04:02 ^🔗	yuri_seva	:)
04:04 ^🔗	yuri_seva	Btw, anybody here going to defcon this year?
04:13 ^🔗	yuri_seva	(In Vegas!)
04:16 ^🔗	db48x	probably
04:16 ^🔗	db48x	not I though
04:16 ^🔗	db48x	and yep, that's the right page
04:26 ^🔗	db48x	I think we would probably use a proxy server to create the warc
04:27 ^🔗	db48x	mostly because it already exists, but also because it means it's easier to share across projects
04:28 ^🔗	db48x	on the other hand it means we'd want to be able to configure the proxy your program uses at runtime :)
04:29 ^🔗	db48x	(although lxc containers would eliminate even that...)
04:30 ^🔗	yuri_seva	Hmm, that's a neat idea. I was looking at the callbacks in the wget lua api, I can see why this helps you create the warc file
04:31 ^🔗	yuri_seva	https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks#callback-download_child_p
04:31 ^🔗	yuri_seva	But I suspect the API Qt gives us hands out enough information to see all the requests
04:32 ^🔗	db48x	yes, I'm sure it would
04:34 ^🔗	yuri_seva	It might be useful initially if we get it to be more of a compliment to the wget-lua toolkit
04:35 ^🔗	yuri_seva	In my experience a headless browser is well-suited for DOM interaction, still a lot of lifting to do if you want to say, censor network access, or parse certain domains only
04:36 ^🔗	yuri_seva	because that has a lot of implications on the browser stack
04:36 ^🔗	yuri_seva	(Like, simple example, if you censor jquery... that's a big impact!)
04:38 ^🔗	yuri_seva	They sort of made it to be simple, but, with that comes the overhead of driving a lot of ajax and media.
04:39 ^🔗	yuri_seva	Perfect metaphor: a LARGE boat =P
04:41 ^🔗	yuri_seva	Maybe you'd do something like, loading a complex page, then handing the DOM elements of interest to a database, or for wget-lua for the actual saving
04:41 ^🔗	yuri_seva	^^ Say, trying to mirror youtube :)
04:41 ^🔗	Coderjoe	wtf
04:41 ^🔗	Coderjoe	two processors for a single drive?
04:41 ^🔗	Coderjoe	er
04:41 ^🔗	Coderjoe	channel
04:42 ^🔗	yuri_seva	Well, kind of a pipeline, you'd load one into the browser stack to execute javascript if there's say, complex data you need to operate on
04:42 ^🔗	yuri_seva	And then put the results somewhere else
04:43 ^🔗	yuri_seva	(This is useful for a lot of sites that use MVC a lot)
04:44 ^🔗	yuri_seva	The actual saving in the browser stack isn't nearly as detailed... it's just high level so it's an easier ship to steer
04:45 ^🔗	yuri_seva	Your wget will always be more efficient, the stack is just good for more complex data extraction, or more complex UI interactions wherein you only care about learning something before saving
04:46 ^🔗	yuri_seva	^^ I might be wrong about the best way to use such a thing, but that's just from what I've sort of seen
04:48 ^🔗	yuri_seva	Obviously you'll never be able to share the connection between the "two processors", but, that's just because a high-level browser stack quite a bit boat to multitask :)
04:48 ^🔗	yuri_seva	You'd never have it pull the same performance as say, wget
04:51 ^🔗	yuri_seva	Good sites where this is useful...
04:51 ^🔗	yuri_seva	Imgur, photobucket... facebook... anybody that uses nasty amounds of javascript that's subject to change a lot
04:53 ^🔗	yuri_seva	Or, MEGA... because they have a lot of js!
04:54 ^🔗	yuri_seva	Anyway, gotta sleep, I'll think on this. Seems that there's a lot of rule sets to try and merge. Maybe possible one day!
04:56 ^🔗	yuri_seva	Thanks for entertaining the idea!

irclogger-viewer