[05:41] didn't even know this channel existed [05:41] ya [05:41] now you know :) [05:41] it exists [05:41] but yeah [05:41] as for security [05:41] this is always going to be an issue to some degree [05:41] so, yeah, if you're proposing security improvements, that's great, but first, what's the threat model [05:41] even if you lock down the API it can still be abused [05:41] well the current solution of a big clinky vm isn't a good idea in the first place [05:41] it works [05:42] what's the shortcomings? [05:42] ryonaloli: why is that not a good idea? [05:42] How big is the warrior? I've never used it [05:42] im thinking that would be awesome, a LUA sandbox with a nice slim API, cross send the data to more then one worker to make sure a worker is behaving, and auto-ban or throttle [05:42] I mean, you call it not a good idea, but then there's empirical data showing that it gets the job done [05:42] it's big, clunky, slow, and not even that secure [05:42] well, then just release a lightweight version running a minimal debian install. [05:42] (I also feel compelled to note that it is unlikely that any changes will be made, by anyone other than the person proposing it, as long as the current system works "well enough" - there's more urgent things to direct time/attention to for many here) [05:43] also, it gets in people's way [05:43] a big clunky window isn't nearly as nice as a tiny little service running in the background with an icon [05:43] so [05:43] ok [05:43] ryonaloli: that's just an implementation detail [05:43] You can launch a virtualbox VM silently [05:43] I want to point out that one reason for the Warrior, and Seesaw, was actually not security [05:43] virtualbox has an API [05:43] UNIFORM REPRESENTATION OF ARCHIVES is the big one [05:43] our install base would explode if we had a simple windows installer [05:43] portability [05:43] what's the reason then? [05:43] portability [05:43] that [05:44] I dont want the warrior taking 3GB of ram :) [05:44] JRWR: you could package the VM with a VirtualBox installer. [05:44] with Geocities and (to some extent) Google video [05:44] portability is only for usability, and as JRWR says, it doesn't make much difference if no one installs anyway [05:44] ryonaloli: usability is exactly the point [05:44] there were problems with Windows machines and oddly-configured Unix-like things returning crap data [05:44] the Warrior VM is intended to fix that, and to a large degree it succeeded in that task [05:44] the reason why is not because it is ultra-secure [05:45] it's just a simple works-almost-everywhere solution [05:45] it's because it was an easy way to get started for a lot more people [05:45] and a lot of people do not tinker with the VM [05:45] without much maintenance required [05:45] even if security is no issue, it's a terrible workaround [05:45] now, if we start getting people who decide to fuck our projects, then we have a new threat [05:46] the aesthetics or security properties of the Warrior aside, I want people to understand *why* it exists [05:46] you cannot propose alternatives without understanding its historical context [05:46] oh and the current workers on the warrior dont work [05:46] only the short url one does at this moment [05:46] they work fine, they're just not tasks [05:46] tasked [05:47] if this becomes bigger, we could always make it redundant, right? [05:47] I also want to caution against roflscale dreams [05:47] most Archive Team projects do not require a thousand-soldier army [05:47] yipdw: roflscale dreams? [05:47] most will never have them [05:47] joepie92: yeah [05:47] the "4chan" army [05:47] get 100k people running warriors [05:49] so, there are claims of the warrior being big, clunky, slow, and not secure [05:49] there's always ways to get huge amounts of clients, even if only through advertising [05:49] I'd like to know of specifics [05:49] i mean, /g/ would *love* this kind of thing [05:49] and the best bang-for-time changes [05:49] specifics? it's the overhead of running an entire OS [05:50] yipdw: one specific I have pointed out before (that is being sidestepped nowadays by running manually) is that running an entire virtualbox VM on a low-end VPS is absolutely not feasible [05:50] but as I already said, that is not hard to solve [05:50] just adapt warrior updating code to run outside the warrior [05:50] joepie92: yeah, but so far the workaround for that is "run the seesaw scripts directly" [05:50] yes, I know [05:50] and that seems to be doing well for the current audience [05:50] that sounds messy [05:50] it really isn't [05:50] yipdw: I really would prefer automated updating; now I need to SSH into xx boxes every time a new project appears [05:50] I'd like that to be automatic [05:51] but it's *doable* now [05:51] you could extract that from the Warrior, run that as a Seesaw task [05:51] yipdw: I was around when we just started doing this manual running thing of pipelines; it has much improved since then :) [05:51] how complicated can it be to make a website grabber that gets instructions from a c&c... i mean, you could just rewrite it all in scratch, no? [05:51] *from scratch [05:51] ryonaloli: there is absolutely no need to rewrite anything from scratch [05:51] there is already a solid framework [05:52] ryonaloli: it's not that hard, it's just that the payoff so far isn't there [05:52] also, JRWR, see http://github.com/joepie91/isohunt-grab [05:52] that is the entirety of manual setup instructions [05:52] virtually every seesaw project will have identical instructions [05:52] most of it is set up once, run many times [05:52] having separate scripts for every project is just nauseating imo... [05:52] Why, every website is different [05:52] i mean, sure it technically works, but there's the KISS princible to remember [05:52] please look at the scripts [05:52] ryonaloli: that is going to be the case anyway [05:52] Need to handle them differently [05:52] ryonaloli: you have to understand that these aren't separate -applications- [05:53] they just define custom behaviour within the seesaw framework [05:53] ryonaloli: this is KISS [05:53] the warrior uses these exact same scripts - it just auto-downloads them [05:53] that's it really [05:53] a highly configurable framework and a strict task schema is falling on the other side of KISS [05:53] why am I a 92? [05:54] I mean, there's a very strong demonstration that this is KISS [05:54] a grabber can be written, tested, and deployed in a couple of days, or less [05:54] ryonaloli: read through this: https://github.com/ArchiveTeam/blip.tv-grab-video-only/blob/master/pipeline.py [05:54] yipdw: uh, one day [05:54] for isohunt-grab [05:54] ok [05:54] can't the tracker send the scripts instead of downloading manually? [05:54] i mean, that'd make sense [05:54] and isohunt-grab is fairly complicated [05:55] ryonaloli; the tracker already tells the warriors where to find the scripts [05:55] this already exists [05:55] it just doesn't run outside the warrior VM yet [05:55] yeah i know, but it's on a vm [05:55] so make it not be in a VM! [05:55] i can't into windows ;_; [05:55] the infrastructure is already there [05:55] exactly [05:55] so our problem isn't code delivery [05:55] our problem is platform compatibility [05:56] which is what the VM solves [05:56] and the compatibility issues aren't (exclusively) caused by the code [05:56] but primarily by inherent platform restrictions [05:56] filesystems etc. [05:56] can't it be stored in a database of some kind? [05:56] Cameron_D: for certain values of "solves", yes [05:57] ryonaloli: can't what be stored in a database? [05:57] well, you said there's issues with filesystem compatibility, and in the other channel i heard something about windows not storing filenames correctly in NTFS [05:58] ryonaloli: databases aren't magic; in a situation like this, if anything, it would be a bottleneck [05:58] (depending on your definition of 'database' - you could technically consider a filesystem to be a database) [05:58] ryonaloli: I guess you could use sqlite or something, and distribute it; but what would be the benefit over telling people "get a VM or run it on these platforms that we know it works on" [05:58] Well, to an extent the warc+cdx is a databse of sorts, but the file still needs to be downloaded and stored first [05:59] we are not going for infinite scalability, and we so far do not need it [05:59] no site has been able to actually exhaust Warrior capacity [05:59] yipdw: because it's overly complicated, and keeps the average altruistic user from doing it [05:59] that's one reason [05:59] A lightweight alternative client might be useful. [05:59] Especially for e.g. deploying on a remote server [05:59] also, ryonaloli, I think that first of all you should become a bit more familiar with the internals of seesaw and the warrior [05:59] because I feel like you misunderstand how it works [05:59] on a technical level [05:59] ryonaloli: "overly complicated" is a bit of a difficult metric, but ok [05:59] odie5533: that already exists in the form of the seesaw script [06:00] Cameron_D: then I guess we have everything! [06:00] https://github.com/joepie91/isohunt-grab#running-without-a-warrior [06:00] maybe some refactoring, perhaps a lightweight Windows native script, but the major pieces are there. [06:01] and I use the term natively only somewhat loosely: could run in the JVM or something. [06:01] hmm [06:01] again, the problem here is the lack of auto-updating really [06:01] (re: seesaw) [06:01] but that is not a hard problem to solve [06:02] just requires work [06:02] What do you mean auto-updating? [06:02] the warrior updates its scripts on a periodic basis [06:02] that's a shell script [06:02] odie5533; automatically retrieving new versions of the pipeline code used to grab websites, and retrieving code for new projects [06:02] like the warrior does [06:02] you could just take that and run it as a separate process in a seesaw project [06:03] I kind of need to continue doing the dishes [06:03] if there's a question specific to me, prefix with my name so I won't miss it when I come back [06:29] for reference, what we need most in my opinion for warrior/seesaw, in this order: complete seesaw documentation, a non-VM Warrior-like wrapper for Linux systems, a spot checking mechanism against potential bad clients, and a pure Python replacement for wget [06:29] (going back to dishes, just figured I'd write that down here before I forgot) [06:32] actually fuck dishes, I have code to write [06:32] and a server to upgrade/migrate [06:33] and... /me eye-scrolls down todo list [06:33] joepie91: I wrote, somewhat, a python replacement for wget [06:33] https://github.com/odie5533/WarcMiddleware [06:34] odie5533: is it a library? or a tool? [06:34] both! [06:34] you can basically just write a Scrapy script to mirror a site, and WarcMiddleware handles the Warc-saving part of it [06:35] Or you can use the included script, which has a CLI [06:35] so you can either write your own scripts, or use command line parameters. Writing your own is more powerful, but the command line params work for most stuff [06:35] mm. [06:36] odie5533: if you throw horribly broken HTML at Scrapy, what does it do? [06:36] :P [06:36] and/or what does it not do [06:36] It uses lxml, which is a very powerful parsing library [06:36] And if it doesn't parse right, you can just modify the parser relatively easily [06:36] afaik lxml stumbles pretty easily [06:36] over non-well-formatted documents [06:36] here's the parser: https://github.com/odie5533/WarcMiddleware/blob/master/crawltest/spiders/simplespider.py [06:37] unless it speaks beautifulsoup nowadays [06:37] it does [06:37] and has for years [06:37] * joepie91 raises eyebrow [06:37] but the default parser seems to work quite well. And regardless, you can use any parser you want by just modifying the simplespider.py [06:37] right [06:38] what is resource usage of scrapy like? [06:38] I don't really know [06:38] It uses Twisted and Python, so I assume not bad. [06:38] ... twisted [06:38] :| [06:38] eh? [06:38] I love Twisted. [06:38] * joepie91 has very bad experiences with Twisted on OpenVZ [06:38] leaks like a motherfucker [06:38] also, I despite Twisted [06:39] can you use Scrapy without ever touching Twisted? [06:39] so, one problem that we have right now is that sites that have a few hundred thousand URLs consume large amounts of memory in wget [06:39] joepie91: most people do. [06:39] s/despite/despise/ [06:39] I'd like to know the memory usage characteristics of scrapy on such a target [06:39] seconding what yipdw said [06:39] joepie91: internally, my WarcMiddleware hooks into Twisted. But you don't need to know about that. Just writing Scrapy scripts is basically not any Twisted [06:39] by "large", I mean around 200 MB [06:39] odie5533: right, that would fall under 'acceptable' then [06:39] yipdw: I don't think it solves that. [06:39] but I am working on a solution right now. [06:39] so it's not THAT large, but it can be a problem [06:39] though I'm still not happy about the size of Twisted as a dependency [06:39] it's a problem for ArchiveBot, at least [06:40] yipdw: the URL strings are too big? [06:40] joepie91: it's like 2 MB? [06:40] odie5533: no, it's just a lot of URLs [06:40] yipdw: I don't understand. [06:40] odie5533: wget keeps track of what it's seen during a recursive retrieval; the amount of storage required is directly proportional to the number of URLs it's seen [06:40] Are the URL strings too big to hold in memory, or does it just trip up if you give it too many URL strings? [06:41] er, it has [06:41] odie5533: if I am not mistaken, it pulls in quite a lot of dependencies [06:41] joepie91: You are mistaken [06:41] it uses Zope Interface and OpenSSL. that's it [06:41] odie5533: the amount of memory wget uses for each new URL is small, but it does add up; and once you get closer to a million URLs, the memory required becomes problematic for some applications [06:42] I'm mentioning this only because this is an ArchiveBot issue [06:42] it is emphatically NOT a problem for most warrior applications [06:42] yipdw: I assume Scrapy could handle it fine, though I'm not sure my WarcMiddleware is up to the task, but it might be [06:42] ok [06:42] I'll take a look at it [06:43] yipdw: the --url-file command inputs a list of urls [06:44] If you are going to try WarcMiddleware, please read the INSTALL.md file. The installation is slightly complicated because you need to use old versions of Twisted and Scrapy. [06:44] odie5533: wget has something similar, but read-from-file mode is a bit different [06:44] at least operationally; they could very well use the same code paths [06:46] yipdw: for archivebot, you could write a WgetManager that only sends it a partial list of urls [06:46] odie5533: streaming in a list is possible, but complicates handling pipeline failure [06:46] odie5533: can you update it to work with newest Twisted and Scrapy..? [06:46] at least in seesaw [06:46] * joepie91 hates unstable deps [06:46] joepie91: Afraid not... =/ [06:46] I don't like them either [06:46] how so? [06:47] but Twisted made major internal changes to support HTTP 1.1, and in those changes it makes it much more difficult to hook into. [06:48] urgh [06:48] so I could either force HTTP1.0, or, the method I chose, is to use a proxy. [06:48] * joepie91 has no plans of moving Twisted out of his "dislike" list any time soon [06:48] then I don't need to use hacks to hook into Scrapy/Twisted. [06:48] I wrote a couple proxies actually :D [06:48] odie5533: proxy in what sense? [06:48] web proxy [06:49] wouldn't that fuck the headers? [06:49] not really [06:49] it can still record the exact headers sent to and from the server [06:49] odie5533: on a completely unrelated note, I think a PhantomJS controller is something else that'd be useful, especially given that many sites no longer work by just freeze-drying resources [06:50] I did write a PhantomJS + WarcProxy pipeline once, but it was unstable as hell [06:50] my WarcProxy? [06:50] yes [06:50] yipdw: a better warc proxy would fix it I think [06:50] WarcProxy was fine [06:50] the pipeline control was the problem [06:50] well, it doesn't do SSL [06:50] the new one does SSL: https://github.com/odie5533/WarcMITMProxy [06:50] oh, that was irrelevant for the purpose of this project [06:51] I suspect WarcMITMProxy could be substituted and the problems would remain [06:51] well, it uses a completely different library [06:51] it wasn't WarcProxy that was the issue; it was controlling the pipeline [06:51] so it might solve the problems. I don't know though [06:51] ah [06:51] is phantomjs like ghostjs? [06:51] the setup was a bit of a mess of poltergeist + capybara + PhantomJS + WarcProxy [06:51] it worked, but required a lot of babysitting [06:52] I think something that could plug into Seesaw as an alternative fetcher would be very useful for making the Warrior more useful for Javascript monstrosites [06:52] well that's sort of why the proxy is a great thing since you just need to write the scraping stuff and let the proxy handle the Warc details [06:52] odie5533: phantomjs is headless Webkit [06:52] made Javascript-scriptable [06:52] oh! this reminds me [06:53] I was thinking of Ghost.py, not ghost.js (blog) [06:53] yipdw: the earlier realize thing that was used for puush and isohunt should probably be a part of seesaw [06:53] Ghost.py I think is not headless. It's a full client [06:53] so it can be used for interpolating ranges [06:53] yipdw: using a headless webkit is going to be really beefy [06:53] joepie91: I don't have release access on seesaw [06:54] yipdw: just making a remark :) [06:54] odie5533: yeah, but beefiness is not really my concern -- a good archive is [06:54] ah [06:54] so that it doesn't get lost in my forgetfulness [06:54] and I'd rather that be done in the simplest way [06:54] also, yipdw, define "release access" [06:54] joepie91: I can't push new Seesaw versions [06:54] did the freeze thing not work? [06:54] on? github? pypi? [06:54] only alard can do that, AFAIK [06:54] because I maintain the pypi package [06:54] oh [06:54] well, ok [06:55] yipdw: https://github.com/iramari/flashfreeze [06:55] yipdw: alard just bothers me every time it needs to be updated :) [06:55] (which is fine with me) [06:55] odie5533: I haven't tried it [06:55] the main reason I used PhantomJS is that I was familiar with its capabilities [06:56] and I knew that it would get me a good archive [06:56] flashfreeze is something else to try [06:56] I did a comparison of them on my WarcProxy page: https://github.com/odie5533/WarcProxy [06:56] it works pretty well, though it misses a little bit [06:57] I'm curious to see how that works on e.g. Discourse forums [06:57] should work [06:57] it runs a full Qt WebKit [06:57] ok [06:58] I'll give that a shot in ArchiveBot [06:58] (someday) [06:58] I think FlashFreeze doesn't output warc though [06:58] oh [06:58] hm [06:59] it outputs URLs and then sends them to wget I think... [06:59] wget outputs WARC, I guess that's enough [06:59] so it outputs all the asset urls [06:59] it's sort of enough. If there are sessions or cookies or something then wget might not be able to grab them [07:00] ah, right [07:00] also uses 2x bandwidth