#warrior 2013-10-28,Mon

↑back Search

Time Nickname Message
05:41 🔗 odie5533 didn't even know this channel existed
05:41 🔗 JRWR ya
05:41 🔗 joepie92 now you know :)
05:41 🔗 JRWR it exists
05:41 🔗 joepie92 but yeah
05:41 🔗 joepie92 as for security
05:41 🔗 joepie92 this is always going to be an issue to some degree
05:41 🔗 yipdw so, yeah, if you're proposing security improvements, that's great, but first, what's the threat model
05:41 🔗 joepie92 even if you lock down the API it can still be abused
05:41 🔗 ryonaloli well the current solution of a big clinky vm isn't a good idea in the first place
05:41 🔗 yipdw it works
05:42 🔗 yipdw what's the shortcomings?
05:42 🔗 odie5533 ryonaloli: why is that not a good idea?
05:42 🔗 odie5533 How big is the warrior? I've never used it
05:42 🔗 JRWR im thinking that would be awesome, a LUA sandbox with a nice slim API, cross send the data to more then one worker to make sure a worker is behaving, and auto-ban or throttle
05:42 🔗 yipdw I mean, you call it not a good idea, but then there's empirical data showing that it gets the job done
05:42 🔗 ryonaloli it's big, clunky, slow, and not even that secure
05:42 🔗 odie5533 well, then just release a lightweight version running a minimal debian install.
05:42 🔗 joepie92 (I also feel compelled to note that it is unlikely that any changes will be made, by anyone other than the person proposing it, as long as the current system works "well enough" - there's more urgent things to direct time/attention to for many here)
05:43 🔗 ryonaloli also, it gets in people's way
05:43 🔗 ryonaloli a big clunky window isn't nearly as nice as a tiny little service running in the background with an icon
05:43 🔗 yipdw so
05:43 🔗 yipdw ok
05:43 🔗 joepie92 ryonaloli: that's just an implementation detail
05:43 🔗 Cameron_D You can launch a virtualbox VM silently
05:43 🔗 yipdw I want to point out that one reason for the Warrior, and Seesaw, was actually not security
05:43 🔗 joepie92 virtualbox has an API
05:43 🔗 yipdw UNIFORM REPRESENTATION OF ARCHIVES is the big one
05:43 🔗 JRWR our install base would explode if we had a simple windows installer
05:43 🔗 odie5533 portability
05:43 🔗 ryonaloli what's the reason then?
05:43 🔗 joepie92 <odie5533>portability
05:43 🔗 joepie92 that
05:44 🔗 JRWR I dont want the warrior taking 3GB of ram :)
05:44 🔗 odie5533 JRWR: you could package the VM with a VirtualBox installer.
05:44 🔗 yipdw with Geocities and (to some extent) Google video
05:44 🔗 ryonaloli portability is only for usability, and as JRWR says, it doesn't make much difference if no one installs anyway
05:44 🔗 joepie92 ryonaloli: usability is exactly the point
05:44 🔗 yipdw there were problems with Windows machines and oddly-configured Unix-like things returning crap data
05:44 🔗 yipdw the Warrior VM is intended to fix that, and to a large degree it succeeded in that task
05:44 🔗 yipdw the reason why is not because it is ultra-secure
05:45 🔗 joepie92 it's just a simple works-almost-everywhere solution
05:45 🔗 yipdw it's because it was an easy way to get started for a lot more people
05:45 🔗 yipdw and a lot of people do not tinker with the VM
05:45 🔗 joepie92 without much maintenance required
05:45 🔗 ryonaloli even if security is no issue, it's a terrible workaround
05:45 🔗 yipdw now, if we start getting people who decide to fuck our projects, then we have a new threat
05:46 🔗 yipdw the aesthetics or security properties of the Warrior aside, I want people to understand *why* it exists
05:46 🔗 yipdw you cannot propose alternatives without understanding its historical context
05:46 🔗 JRWR oh and the current workers on the warrior dont work
05:46 🔗 JRWR only the short url one does at this moment
05:46 🔗 yipdw they work fine, they're just not tasks
05:46 🔗 yipdw tasked
05:47 🔗 ryonaloli if this becomes bigger, we could always make it redundant, right?
05:47 🔗 yipdw I also want to caution against roflscale dreams
05:47 🔗 yipdw most Archive Team projects do not require a thousand-soldier army
05:47 🔗 joepie92 yipdw: roflscale dreams?
05:47 🔗 yipdw most will never have them
05:47 🔗 yipdw joepie92: yeah
05:47 🔗 JRWR the "4chan" army
05:47 🔗 JRWR get 100k people running warriors
05:49 🔗 yipdw so, there are claims of the warrior being big, clunky, slow, and not secure
05:49 🔗 ryonaloli there's always ways to get huge amounts of clients, even if only through advertising
05:49 🔗 yipdw I'd like to know of specifics
05:49 🔗 ryonaloli i mean, /g/ would *love* this kind of thing
05:49 🔗 yipdw and the best bang-for-time changes
05:49 🔗 ryonaloli specifics? it's the overhead of running an entire OS
05:50 🔗 joepie92 yipdw: one specific I have pointed out before (that is being sidestepped nowadays by running manually) is that running an entire virtualbox VM on a low-end VPS is absolutely not feasible
05:50 🔗 joepie92 but as I already said, that is not hard to solve
05:50 🔗 joepie92 just adapt warrior updating code to run outside the warrior
05:50 🔗 yipdw joepie92: yeah, but so far the workaround for that is "run the seesaw scripts directly"
05:50 🔗 joepie92 yes, I know
05:50 🔗 yipdw and that seems to be doing well for the current audience
05:50 🔗 JRWR that sounds messy
05:50 🔗 yipdw it really isn't
05:50 🔗 joepie92 yipdw: I really would prefer automated updating; now I need to SSH into xx boxes every time a new project appears
05:50 🔗 joepie92 I'd like that to be automatic
05:51 🔗 joepie92 but it's *doable* now
05:51 🔗 yipdw you could extract that from the Warrior, run that as a Seesaw task
05:51 🔗 joepie92 yipdw: I was around when we just started doing this manual running thing of pipelines; it has much improved since then :)
05:51 🔗 ryonaloli how complicated can it be to make a website grabber that gets instructions from a c&c... i mean, you could just rewrite it all in scratch, no?
05:51 🔗 ryonaloli *from scratch
05:51 🔗 joepie92 ryonaloli: there is absolutely no need to rewrite anything from scratch
05:51 🔗 joepie92 there is already a solid framework
05:52 🔗 yipdw ryonaloli: it's not that hard, it's just that the payoff so far isn't there
05:52 🔗 joepie92 also, JRWR, see http://github.com/joepie91/isohunt-grab
05:52 🔗 joepie92 that is the entirety of manual setup instructions
05:52 🔗 joepie92 virtually every seesaw project will have identical instructions
05:52 🔗 joepie92 most of it is set up once, run many times
05:52 🔗 ryonaloli having separate scripts for every project is just nauseating imo...
05:52 🔗 Cameron_D Why, every website is different
05:52 🔗 ryonaloli i mean, sure it technically works, but there's the KISS princible to remember
05:52 🔗 yipdw please look at the scripts
05:52 🔗 joepie92 ryonaloli: that is going to be the case anyway
05:52 🔗 Cameron_D Need to handle them differently
05:52 🔗 joepie92 ryonaloli: you have to understand that these aren't separate -applications-
05:53 🔗 joepie92 they just define custom behaviour within the seesaw framework
05:53 🔗 yipdw ryonaloli: this is KISS
05:53 🔗 joepie92 the warrior uses these exact same scripts - it just auto-downloads them
05:53 🔗 joepie92 that's it really
05:53 🔗 yipdw a highly configurable framework and a strict task schema is falling on the other side of KISS
05:53 🔗 joepie92 why am I a 92?
05:54 🔗 yipdw I mean, there's a very strong demonstration that this is KISS
05:54 🔗 yipdw a grabber can be written, tested, and deployed in a couple of days, or less
05:54 🔗 joepie91 ryonaloli: read through this: https://github.com/ArchiveTeam/blip.tv-grab-video-only/blob/master/pipeline.py
05:54 🔗 joepie91 yipdw: uh, one day
05:54 🔗 joepie91 for isohunt-grab
05:54 🔗 yipdw ok
05:54 🔗 ryonaloli can't the tracker send the scripts instead of downloading manually?
05:54 🔗 ryonaloli i mean, that'd make sense
05:54 🔗 joepie91 and isohunt-grab is fairly complicated
05:55 🔗 joepie91 ryonaloli; the tracker already tells the warriors where to find the scripts
05:55 🔗 joepie91 this already exists
05:55 🔗 joepie91 it just doesn't run outside the warrior VM yet
05:55 🔗 ryonaloli yeah i know, but it's on a vm
05:55 🔗 joepie91 so make it not be in a VM!
05:55 🔗 ryonaloli i can't into windows ;_;
05:55 🔗 joepie91 the infrastructure is already there
05:55 🔗 joepie91 exactly
05:55 🔗 joepie91 so our problem isn't code delivery
05:55 🔗 joepie91 our problem is platform compatibility
05:56 🔗 Cameron_D which is what the VM solves
05:56 🔗 joepie91 and the compatibility issues aren't (exclusively) caused by the code
05:56 🔗 joepie91 but primarily by inherent platform restrictions
05:56 🔗 joepie91 filesystems etc.
05:56 🔗 ryonaloli can't it be stored in a database of some kind?
05:56 🔗 joepie91 Cameron_D: for certain values of "solves", yes
05:57 🔗 joepie91 ryonaloli: can't what be stored in a database?
05:57 🔗 ryonaloli well, you said there's issues with filesystem compatibility, and in the other channel i heard something about windows not storing filenames correctly in NTFS
05:58 🔗 joepie91 ryonaloli: databases aren't magic; in a situation like this, if anything, it would be a bottleneck
05:58 🔗 joepie91 (depending on your definition of 'database' - you could technically consider a filesystem to be a database)
05:58 🔗 yipdw ryonaloli: I guess you could use sqlite or something, and distribute it; but what would be the benefit over telling people "get a VM or run it on these platforms that we know it works on"
05:58 🔗 Cameron_D Well, to an extent the warc+cdx is a databse of sorts, but the file still needs to be downloaded and stored first
05:59 🔗 yipdw we are not going for infinite scalability, and we so far do not need it
05:59 🔗 yipdw no site has been able to actually exhaust Warrior capacity
05:59 🔗 ryonaloli yipdw: because it's overly complicated, and keeps the average altruistic user from doing it
05:59 🔗 ryonaloli that's one reason
05:59 🔗 odie5533 A lightweight alternative client might be useful.
05:59 🔗 odie5533 Especially for e.g. deploying on a remote server
05:59 🔗 joepie91 also, ryonaloli, I think that first of all you should become a bit more familiar with the internals of seesaw and the warrior
05:59 🔗 joepie91 because I feel like you misunderstand how it works
05:59 🔗 joepie91 on a technical level
05:59 🔗 yipdw ryonaloli: "overly complicated" is a bit of a difficult metric, but ok
05:59 🔗 Cameron_D odie5533: that already exists in the form of the seesaw script
06:00 🔗 odie5533 Cameron_D: then I guess we have everything!
06:00 🔗 Cameron_D https://github.com/joepie91/isohunt-grab#running-without-a-warrior
06:00 🔗 odie5533 maybe some refactoring, perhaps a lightweight Windows native script, but the major pieces are there.
06:01 🔗 odie5533 and I use the term natively only somewhat loosely: could run in the JVM or something.
06:01 🔗 yipdw hmm
06:01 🔗 joepie91 again, the problem here is the lack of auto-updating really
06:01 🔗 joepie91 (re: seesaw)
06:01 🔗 joepie91 but that is not a hard problem to solve
06:02 🔗 joepie91 just requires work
06:02 🔗 odie5533 What do you mean auto-updating?
06:02 🔗 yipdw the warrior updates its scripts on a periodic basis
06:02 🔗 yipdw that's a shell script
06:02 🔗 joepie91 odie5533; automatically retrieving new versions of the pipeline code used to grab websites, and retrieving code for new projects
06:02 🔗 joepie91 like the warrior does
06:02 🔗 yipdw you could just take that and run it as a separate process in a seesaw project
06:03 🔗 joepie91 I kind of need to continue doing the dishes
06:03 🔗 joepie91 if there's a question specific to me, prefix with my name so I won't miss it when I come back
06:29 🔗 joepie91 for reference, what we need most in my opinion for warrior/seesaw, in this order: complete seesaw documentation, a non-VM Warrior-like wrapper for Linux systems, a spot checking mechanism against potential bad clients, and a pure Python replacement for wget
06:29 🔗 joepie91 (going back to dishes, just figured I'd write that down here before I forgot)
06:32 🔗 joepie91 actually fuck dishes, I have code to write
06:32 🔗 joepie91 and a server to upgrade/migrate
06:33 🔗 joepie91 and... /me eye-scrolls down todo list
06:33 🔗 odie5533 joepie91: I wrote, somewhat, a python replacement for wget
06:33 🔗 odie5533 https://github.com/odie5533/WarcMiddleware
06:34 🔗 joepie91 odie5533: is it a library? or a tool?
06:34 🔗 odie5533 both!
06:34 🔗 odie5533 you can basically just write a Scrapy script to mirror a site, and WarcMiddleware handles the Warc-saving part of it
06:35 🔗 odie5533 Or you can use the included script, which has a CLI
06:35 🔗 odie5533 so you can either write your own scripts, or use command line parameters. Writing your own is more powerful, but the command line params work for most stuff
06:35 🔗 joepie91 mm.
06:36 🔗 joepie91 odie5533: if you throw horribly broken HTML at Scrapy, what does it do?
06:36 🔗 joepie91 :P
06:36 🔗 joepie91 and/or what does it not do
06:36 🔗 odie5533 It uses lxml, which is a very powerful parsing library
06:36 🔗 odie5533 And if it doesn't parse right, you can just modify the parser relatively easily
06:36 🔗 joepie91 afaik lxml stumbles pretty easily
06:36 🔗 joepie91 over non-well-formatted documents
06:36 🔗 odie5533 here's the parser: https://github.com/odie5533/WarcMiddleware/blob/master/crawltest/spiders/simplespider.py
06:37 🔗 joepie91 unless it speaks beautifulsoup nowadays
06:37 🔗 odie5533 it does
06:37 🔗 odie5533 and has for years
06:37 🔗 * joepie91 raises eyebrow
06:37 🔗 odie5533 but the default parser seems to work quite well. And regardless, you can use any parser you want by just modifying the simplespider.py
06:37 🔗 joepie91 right
06:38 🔗 joepie91 what is resource usage of scrapy like?
06:38 🔗 odie5533 I don't really know
06:38 🔗 odie5533 It uses Twisted and Python, so I assume not bad.
06:38 🔗 joepie91 ... twisted
06:38 🔗 joepie91 :|
06:38 🔗 odie5533 eh?
06:38 🔗 odie5533 I love Twisted.
06:38 🔗 * joepie91 has very bad experiences with Twisted on OpenVZ
06:38 🔗 joepie91 leaks like a motherfucker
06:38 🔗 joepie91 also, I despite Twisted
06:39 🔗 joepie91 can you use Scrapy without ever touching Twisted?
06:39 🔗 yipdw so, one problem that we have right now is that sites that have a few hundred thousand URLs consume large amounts of memory in wget
06:39 🔗 odie5533 joepie91: most people do.
06:39 🔗 joepie91 s/despite/despise/
06:39 🔗 yipdw I'd like to know the memory usage characteristics of scrapy on such a target
06:39 🔗 joepie91 seconding what yipdw said
06:39 🔗 odie5533 joepie91: internally, my WarcMiddleware hooks into Twisted. But you don't need to know about that. Just writing Scrapy scripts is basically not any Twisted
06:39 🔗 yipdw by "large", I mean around 200 MB
06:39 🔗 joepie91 odie5533: right, that would fall under 'acceptable' then
06:39 🔗 odie5533 yipdw: I don't think it solves that.
06:39 🔗 odie5533 but I am working on a solution right now.
06:39 🔗 yipdw so it's not THAT large, but it can be a problem
06:39 🔗 joepie91 though I'm still not happy about the size of Twisted as a dependency
06:39 🔗 yipdw it's a problem for ArchiveBot, at least
06:40 🔗 odie5533 yipdw: the URL strings are too big?
06:40 🔗 odie5533 joepie91: it's like 2 MB?
06:40 🔗 yipdw odie5533: no, it's just a lot of URLs
06:40 🔗 odie5533 yipdw: I don't understand.
06:40 🔗 yipdw odie5533: wget keeps track of what it's seen during a recursive retrieval; the amount of storage required is directly proportional to the number of URLs it's seen
06:40 🔗 odie5533 Are the URL strings too big to hold in memory, or does it just trip up if you give it too many URL strings?
06:41 🔗 yipdw er, it has
06:41 🔗 joepie91 odie5533: if I am not mistaken, it pulls in quite a lot of dependencies
06:41 🔗 odie5533 joepie91: You are mistaken
06:41 🔗 odie5533 it uses Zope Interface and OpenSSL. that's it
06:41 🔗 yipdw odie5533: the amount of memory wget uses for each new URL is small, but it does add up; and once you get closer to a million URLs, the memory required becomes problematic for some applications
06:42 🔗 yipdw I'm mentioning this only because this is an ArchiveBot issue
06:42 🔗 yipdw it is emphatically NOT a problem for most warrior applications
06:42 🔗 odie5533 yipdw: I assume Scrapy could handle it fine, though I'm not sure my WarcMiddleware is up to the task, but it might be
06:42 🔗 yipdw ok
06:42 🔗 yipdw I'll take a look at it
06:43 🔗 odie5533 yipdw: the --url-file command inputs a list of urls
06:44 🔗 odie5533 If you are going to try WarcMiddleware, please read the INSTALL.md file. The installation is slightly complicated because you need to use old versions of Twisted and Scrapy.
06:44 🔗 yipdw odie5533: wget has something similar, but read-from-file mode is a bit different
06:44 🔗 yipdw at least operationally; they could very well use the same code paths
06:46 🔗 odie5533 yipdw: for archivebot, you could write a WgetManager that only sends it a partial list of urls
06:46 🔗 yipdw odie5533: streaming in a list is possible, but complicates handling pipeline failure
06:46 🔗 joepie91 odie5533: can you update it to work with newest Twisted and Scrapy..?
06:46 🔗 yipdw at least in seesaw
06:46 🔗 * joepie91 hates unstable deps
06:46 🔗 odie5533 joepie91: Afraid not... =/
06:46 🔗 odie5533 I don't like them either
06:46 🔗 joepie91 how so?
06:47 🔗 odie5533 but Twisted made major internal changes to support HTTP 1.1, and in those changes it makes it much more difficult to hook into.
06:48 🔗 joepie91 urgh
06:48 🔗 odie5533 so I could either force HTTP1.0, or, the method I chose, is to use a proxy.
06:48 🔗 * joepie91 has no plans of moving Twisted out of his "dislike" list any time soon
06:48 🔗 odie5533 then I don't need to use hacks to hook into Scrapy/Twisted.
06:48 🔗 odie5533 I wrote a couple proxies actually :D
06:48 🔗 joepie91 odie5533: proxy in what sense?
06:48 🔗 odie5533 web proxy
06:49 🔗 joepie91 wouldn't that fuck the headers?
06:49 🔗 odie5533 not really
06:49 🔗 odie5533 it can still record the exact headers sent to and from the server
06:49 🔗 yipdw odie5533: on a completely unrelated note, I think a PhantomJS controller is something else that'd be useful, especially given that many sites no longer work by just freeze-drying resources
06:50 🔗 yipdw I did write a PhantomJS + WarcProxy pipeline once, but it was unstable as hell
06:50 🔗 odie5533 my WarcProxy?
06:50 🔗 yipdw yes
06:50 🔗 odie5533 yipdw: a better warc proxy would fix it I think
06:50 🔗 yipdw WarcProxy was fine
06:50 🔗 yipdw the pipeline control was the problem
06:50 🔗 odie5533 well, it doesn't do SSL
06:50 🔗 odie5533 the new one does SSL: https://github.com/odie5533/WarcMITMProxy
06:50 🔗 yipdw oh, that was irrelevant for the purpose of this project
06:51 🔗 yipdw I suspect WarcMITMProxy could be substituted and the problems would remain
06:51 🔗 odie5533 well, it uses a completely different library
06:51 🔗 yipdw it wasn't WarcProxy that was the issue; it was controlling the pipeline
06:51 🔗 odie5533 so it might solve the problems. I don't know though
06:51 🔗 odie5533 ah
06:51 🔗 odie5533 is phantomjs like ghostjs?
06:51 🔗 yipdw the setup was a bit of a mess of poltergeist + capybara + PhantomJS + WarcProxy
06:51 🔗 yipdw it worked, but required a lot of babysitting
06:52 🔗 yipdw I think something that could plug into Seesaw as an alternative fetcher would be very useful for making the Warrior more useful for Javascript monstrosites
06:52 🔗 odie5533 well that's sort of why the proxy is a great thing since you just need to write the scraping stuff and let the proxy handle the Warc details
06:52 🔗 yipdw odie5533: phantomjs is headless Webkit
06:52 🔗 yipdw made Javascript-scriptable
06:52 🔗 joepie91 oh! this reminds me
06:53 🔗 odie5533 I was thinking of Ghost.py, not ghost.js (blog)
06:53 🔗 joepie91 yipdw: the earlier realize thing that was used for puush and isohunt should probably be a part of seesaw
06:53 🔗 odie5533 Ghost.py I think is not headless. It's a full client
06:53 🔗 joepie91 so it can be used for interpolating ranges
06:53 🔗 odie5533 yipdw: using a headless webkit is going to be really beefy
06:53 🔗 yipdw joepie91: I don't have release access on seesaw
06:54 🔗 joepie91 yipdw: just making a remark :)
06:54 🔗 yipdw odie5533: yeah, but beefiness is not really my concern -- a good archive is
06:54 🔗 odie5533 ah
06:54 🔗 joepie91 so that it doesn't get lost in my forgetfulness
06:54 🔗 yipdw and I'd rather that be done in the simplest way
06:54 🔗 joepie91 also, yipdw, define "release access"
06:54 🔗 yipdw joepie91: I can't push new Seesaw versions
06:54 🔗 odie5533 did the freeze thing not work?
06:54 🔗 joepie91 on? github? pypi?
06:54 🔗 yipdw only alard can do that, AFAIK
06:54 🔗 joepie91 because I maintain the pypi package
06:54 🔗 yipdw oh
06:54 🔗 yipdw well, ok
06:55 🔗 odie5533 yipdw: https://github.com/iramari/flashfreeze
06:55 🔗 joepie91 yipdw: alard just bothers me every time it needs to be updated :)
06:55 🔗 joepie91 (which is fine with me)
06:55 🔗 yipdw odie5533: I haven't tried it
06:55 🔗 yipdw the main reason I used PhantomJS is that I was familiar with its capabilities
06:56 🔗 yipdw and I knew that it would get me a good archive
06:56 🔗 yipdw flashfreeze is something else to try
06:56 🔗 odie5533 I did a comparison of them on my WarcProxy page: https://github.com/odie5533/WarcProxy
06:56 🔗 odie5533 it works pretty well, though it misses a little bit
06:57 🔗 yipdw I'm curious to see how that works on e.g. Discourse forums
06:57 🔗 odie5533 should work
06:57 🔗 odie5533 it runs a full Qt WebKit
06:57 🔗 yipdw ok
06:58 🔗 yipdw I'll give that a shot in ArchiveBot
06:58 🔗 yipdw (someday)
06:58 🔗 odie5533 I think FlashFreeze doesn't output warc though
06:58 🔗 yipdw oh
06:58 🔗 yipdw hm
06:59 🔗 odie5533 it outputs URLs and then sends them to wget I think...
06:59 🔗 yipdw wget outputs WARC, I guess that's enough
06:59 🔗 odie5533 so it outputs all the asset urls
06:59 🔗 odie5533 it's sort of enough. If there are sessions or cookies or something then wget might not be able to grab them
07:00 🔗 yipdw ah, right
07:00 🔗 odie5533 also uses 2x bandwidth

irclogger-viewer