Time |
Nickname |
Message |
03:13
🔗
|
yuri_seva |
Hey guys, I'm looking for some details on the warrior scraping API, capabilities, etc... |
03:13
🔗
|
yuri_seva |
The reason I'm asking is because I'm working on something similar, armed at complex sites like facebook, etc... javascript/dom intensive stuffs :) |
03:14
🔗
|
yuri_seva |
(Based on Qt!) |
03:14
🔗
|
yuri_seva |
^^ But I figured I'd reach out to the community and see if there's any common ground. I love what you're doing! |
03:16
🔗
|
chfoo |
yuri_seva: for hyves project, the javascript stuff was all analyzed in the browser and most of the requests were hand crafted with lots of regex and scripting in wget-lua |
03:17
🔗
|
chfoo |
as far as i know, we haven't used any webkit based solutions |
03:18
🔗
|
yuri_seva |
Yeah, that's what I'm curious about. Qt >= 5.3 has a prototype project based on chromium. Combine that with the new "offscreen" platform plugin, and suddenly you can actually execute javascript in headless c++ |
03:19
🔗
|
yuri_seva |
I've been testing it with facebook, getting really neat results |
03:20
🔗
|
chfoo |
yuri_seva: we are definitely curious about handling javascript. archivebot has phantomjs wedged into it but it's far from perfect. |
03:22
🔗
|
chfoo |
but generally, warrior projects are very specific and there hasn't been any serious issues with javascript yet |
03:24
🔗
|
chfoo |
however the warrior allows running arbitrary binaries so anything that works in debian 6 can be used |
03:27
🔗
|
yuri_seva |
I'll have to try digging through the scripts in the vm image. Trying to make easier ways of extracting model data from forums, blogs, etc. |
03:28
🔗
|
yuri_seva |
Here's a tool I wrote... based on some earlier work with Qt |
03:28
🔗
|
yuri_seva |
https://github.com/yuri-sevatz/libfacebook/blob/master/fbhack/fbhack.cpp |
03:28
🔗
|
yuri_seva |
^^ They don't let you do that in their TOS ;) |
03:28
🔗
|
yuri_seva |
The cool thing is, I have no idea what half the code in their site does... but I can just drive it without paying attention |
03:29
🔗
|
yuri_seva |
and their xml is scary! |
03:30
🔗
|
yuri_seva |
And this is actually all I had to do for full-blown login/integration |
03:30
🔗
|
yuri_seva |
https://github.com/yuri-sevatz/libfacebook/blob/master/libfacebook/clientprivate.cpp |
03:30
🔗
|
yuri_seva |
Anyway, do you guys need anything like that? |
03:31
🔗
|
yuri_seva |
Like... tools to make these easier, or more playful |
03:34
🔗
|
db48x |
yuri_seva: yes |
03:35
🔗
|
yuri_seva |
Specifically for things like asynchronous DOM loading, high-level javascript execution, CSS2 selectors... things to make our lives more fun :) |
03:37
🔗
|
yuri_seva |
lol nice, Okay, uhmm... let me spend some time going through your toolkit. I'll try and get a feel for what kind of primitives the framework lets you get and set |
03:38
🔗
|
db48x |
yuri_seva: the warrior code only deals with the outer process of starting a job, running the steps of that job and running them concurrently with each other |
03:39
🔗
|
db48x |
the steps of the job are generally things like making a directory to hold the archived data, running wget or a similar tool with the right options to archive something, then rsyncing that somewhere |
03:40
🔗
|
yuri_seva |
yeah I think it's the wget (and below) part that this part entails |
03:41
🔗
|
yuri_seva |
The actual extraction part |
03:42
🔗
|
yuri_seva |
i caught wind of some of the modified versions of wget |
03:45
🔗
|
chfoo |
well, something that is missing from wget is the javascript/dom portion of the web browser stack. if there was a python/ruby library that allowed you to put in html/js and have it spit back out resources requested/dom modifications/cookie changes, that would be really useful. |
03:48
🔗
|
db48x |
chfoo: interaction is a big part of it though |
03:48
🔗
|
yuri_seva |
Yeah it's a bit of a neat thing... you can kind of get away with loading html/javascript from a text blob, but it still forces you to create a context (Sort of saying... "where the page came from, what's it relative to, etc") |
03:48
🔗
|
yuri_seva |
Once you have that, you can access it just like a user, but in code |
03:49
🔗
|
db48x |
https://github.com/yuri-sevatz/libfacebook/blob/master/libfacebook/clientprivate.cpp#L97-100 |
03:49
🔗
|
yuri_seva |
Exactly! -- and the javascript execution after, things like "this.click()" |
03:50
🔗
|
yuri_seva |
That lets you easily take on complex companies like google and facebook |
03:50
🔗
|
yuri_seva |
Now... I'm trying to find a way to wrap this (either with a library, or CLI) -- to make maintaining web template interactions easier |
03:54
🔗
|
yuri_seva |
So that when you say "give be this thread", you can get back a formatted set/dictionary of data, sort of specific to a model you define (on a per-site basis) |
03:54
🔗
|
* |
db48x nods |
03:55
🔗
|
db48x |
we also like to capture the http requests themselves in a WARC |
03:55
🔗
|
db48x |
since that lets us reproduce the exact site |
03:55
🔗
|
yuri_seva |
QNetworkAccessManager -- each browser instance comes with one |
03:56
🔗
|
yuri_seva |
you can MITM or listen to everything in and out |
03:56
🔗
|
yuri_seva |
I use similar tricks to find out how busy the AJAX is ;) |
03:57
🔗
|
yuri_seva |
There's other cool features too... recently I just found out how to turn on/off media loading |
03:58
🔗
|
yuri_seva |
so I can go "save bandwidth", or "stealth" to reduce the digital fingerprint, lol |
04:00
🔗
|
yuri_seva |
Oh... and you'll like this |
04:00
🔗
|
yuri_seva |
I did some javascript injection too! |
04:00
🔗
|
yuri_seva |
To modify complex pages while they're running, and hook javascript events |
04:01
🔗
|
yuri_seva |
Then get the results back in c++ ... very cool stuff |
04:02
🔗
|
yuri_seva |
I'll dig around more, need to get up to speed on what you have/need... and we'll try and get something fun for the toolkit |
04:02
🔗
|
yuri_seva |
This looks like a good place to start |
04:02
🔗
|
yuri_seva |
http://www.archiveteam.org/index.php?title=Dev/Source_Code |
04:02
🔗
|
yuri_seva |
:) |
04:04
🔗
|
yuri_seva |
Btw, anybody here going to defcon this year? |
04:13
🔗
|
yuri_seva |
(In Vegas!) |
04:16
🔗
|
db48x |
probably |
04:16
🔗
|
db48x |
not I though |
04:16
🔗
|
db48x |
and yep, that's the right page |
04:26
🔗
|
db48x |
I think we would probably use a proxy server to create the warc |
04:27
🔗
|
db48x |
mostly because it already exists, but also because it means it's easier to share across projects |
04:28
🔗
|
db48x |
on the other hand it means we'd want to be able to configure the proxy your program uses at runtime :) |
04:29
🔗
|
db48x |
(although lxc containers would eliminate even that...) |
04:30
🔗
|
yuri_seva |
Hmm, that's a neat idea. I was looking at the callbacks in the wget lua api, I can see why this helps you create the warc file |
04:31
🔗
|
yuri_seva |
https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks#callback-download_child_p |
04:31
🔗
|
yuri_seva |
But I suspect the API Qt gives us hands out enough information to see all the requests |
04:32
🔗
|
db48x |
yes, I'm sure it would |
04:34
🔗
|
yuri_seva |
It might be useful initially if we get it to be more of a compliment to the wget-lua toolkit |
04:35
🔗
|
yuri_seva |
In my experience a headless browser is well-suited for DOM interaction, still a lot of lifting to do if you want to say, censor network access, or parse certain domains only |
04:36
🔗
|
yuri_seva |
because that has a lot of implications on the browser stack |
04:36
🔗
|
yuri_seva |
(Like, simple example, if you censor jquery... that's a big impact!) |
04:38
🔗
|
yuri_seva |
They sort of made it to be simple, but, with that comes the overhead of driving a lot of ajax and media. |
04:39
🔗
|
yuri_seva |
Perfect metaphor: a LARGE boat =P |
04:41
🔗
|
yuri_seva |
Maybe you'd do something like, loading a complex page, then handing the DOM elements of interest to a database, or for wget-lua for the actual saving |
04:41
🔗
|
yuri_seva |
^^ Say, trying to mirror youtube :) |
04:41
🔗
|
Coderjoe |
wtf |
04:41
🔗
|
Coderjoe |
two processors for a single drive? |
04:41
🔗
|
Coderjoe |
er |
04:41
🔗
|
Coderjoe |
channel |
04:42
🔗
|
yuri_seva |
Well, kind of a pipeline, you'd load one into the browser stack to execute javascript if there's say, complex data you need to operate on |
04:42
🔗
|
yuri_seva |
And then put the results somewhere else |
04:43
🔗
|
yuri_seva |
(This is useful for a lot of sites that use MVC a lot) |
04:44
🔗
|
yuri_seva |
The actual saving in the browser stack isn't nearly as detailed... it's just high level so it's an easier ship to steer |
04:45
🔗
|
yuri_seva |
Your wget will always be more efficient, the stack is just good for more complex data extraction, or more complex UI interactions wherein you only care about learning something before saving |
04:46
🔗
|
yuri_seva |
^^ I might be wrong about the best way to use such a thing, but that's just from what I've sort of seen |
04:48
🔗
|
yuri_seva |
Obviously you'll never be able to share the connection between the "two processors", but, that's just because a high-level browser stack quite a bit boat to multitask :) |
04:48
🔗
|
yuri_seva |
You'd never have it pull the same performance as say, wget |
04:51
🔗
|
yuri_seva |
Good sites where this is useful... |
04:51
🔗
|
yuri_seva |
Imgur, photobucket... facebook... anybody that uses nasty amounds of javascript that's subject to change a lot |
04:53
🔗
|
yuri_seva |
Or, MEGA... because they have a lot of js! |
04:54
🔗
|
yuri_seva |
Anyway, gotta sleep, I'll think on this. Seems that there's a lot of rule sets to try and merge. Maybe possible one day! |
04:56
🔗
|
yuri_seva |
Thanks for entertaining the idea! |