#warrior 2013-10-28,Mon

↑back Search

Time	Nickname	Message
05:41 ^🔗	odie5533	didn't even know this channel existed
05:41 ^🔗	JRWR	ya
05:41 ^🔗	joepie92	now you know :)
05:41 ^🔗	JRWR	it exists
05:41 ^🔗	joepie92	but yeah
05:41 ^🔗	joepie92	as for security
05:41 ^🔗	joepie92	this is always going to be an issue to some degree
05:41 ^🔗	yipdw	so, yeah, if you're proposing security improvements, that's great, but first, what's the threat model
05:41 ^🔗	joepie92	even if you lock down the API it can still be abused
05:41 ^🔗	ryonaloli	well the current solution of a big clinky vm isn't a good idea in the first place
05:41 ^🔗	yipdw	it works
05:42 ^🔗	yipdw	what's the shortcomings?
05:42 ^🔗	odie5533	ryonaloli: why is that not a good idea?
05:42 ^🔗	odie5533	How big is the warrior? I've never used it
05:42 ^🔗	JRWR	im thinking that would be awesome, a LUA sandbox with a nice slim API, cross send the data to more then one worker to make sure a worker is behaving, and auto-ban or throttle
05:42 ^🔗	yipdw	I mean, you call it not a good idea, but then there's empirical data showing that it gets the job done
05:42 ^🔗	ryonaloli	it's big, clunky, slow, and not even that secure
05:42 ^🔗	odie5533	well, then just release a lightweight version running a minimal debian install.
05:42 ^🔗	joepie92	(I also feel compelled to note that it is unlikely that any changes will be made, by anyone other than the person proposing it, as long as the current system works "well enough" - there's more urgent things to direct time/attention to for many here)
05:43 ^🔗	ryonaloli	also, it gets in people's way
05:43 ^🔗	ryonaloli	a big clunky window isn't nearly as nice as a tiny little service running in the background with an icon
05:43 ^🔗	yipdw	so
05:43 ^🔗	yipdw	ok
05:43 ^🔗	joepie92	ryonaloli: that's just an implementation detail
05:43 ^🔗	Cameron_D	You can launch a virtualbox VM silently
05:43 ^🔗	yipdw	I want to point out that one reason for the Warrior, and Seesaw, was actually not security
05:43 ^🔗	joepie92	virtualbox has an API
05:43 ^🔗	yipdw	UNIFORM REPRESENTATION OF ARCHIVES is the big one
05:43 ^🔗	JRWR	our install base would explode if we had a simple windows installer
05:43 ^🔗	odie5533	portability
05:43 ^🔗	ryonaloli	what's the reason then?
05:43 ^🔗	joepie92	<odie5533>portability
05:43 ^🔗	joepie92	that
05:44 ^🔗	JRWR	I dont want the warrior taking 3GB of ram :)
05:44 ^🔗	odie5533	JRWR: you could package the VM with a VirtualBox installer.
05:44 ^🔗	yipdw	with Geocities and (to some extent) Google video
05:44 ^🔗	ryonaloli	portability is only for usability, and as JRWR says, it doesn't make much difference if no one installs anyway
05:44 ^🔗	joepie92	ryonaloli: usability is exactly the point
05:44 ^🔗	yipdw	there were problems with Windows machines and oddly-configured Unix-like things returning crap data
05:44 ^🔗	yipdw	the Warrior VM is intended to fix that, and to a large degree it succeeded in that task
05:44 ^🔗	yipdw	the reason why is not because it is ultra-secure
05:45 ^🔗	joepie92	it's just a simple works-almost-everywhere solution
05:45 ^🔗	yipdw	it's because it was an easy way to get started for a lot more people
05:45 ^🔗	yipdw	and a lot of people do not tinker with the VM
05:45 ^🔗	joepie92	without much maintenance required
05:45 ^🔗	ryonaloli	even if security is no issue, it's a terrible workaround
05:45 ^🔗	yipdw	now, if we start getting people who decide to fuck our projects, then we have a new threat
05:46 ^🔗	yipdw	the aesthetics or security properties of the Warrior aside, I want people to understand why it exists
05:46 ^🔗	yipdw	you cannot propose alternatives without understanding its historical context
05:46 ^🔗	JRWR	oh and the current workers on the warrior dont work
05:46 ^🔗	JRWR	only the short url one does at this moment
05:46 ^🔗	yipdw	they work fine, they're just not tasks
05:46 ^🔗	yipdw	tasked
05:47 ^🔗	ryonaloli	if this becomes bigger, we could always make it redundant, right?
05:47 ^🔗	yipdw	I also want to caution against roflscale dreams
05:47 ^🔗	yipdw	most Archive Team projects do not require a thousand-soldier army
05:47 ^🔗	joepie92	yipdw: roflscale dreams?
05:47 ^🔗	yipdw	most will never have them
05:47 ^🔗	yipdw	joepie92: yeah
05:47 ^🔗	JRWR	the "4chan" army
05:47 ^🔗	JRWR	get 100k people running warriors
05:49 ^🔗	yipdw	so, there are claims of the warrior being big, clunky, slow, and not secure
05:49 ^🔗	ryonaloli	there's always ways to get huge amounts of clients, even if only through advertising
05:49 ^🔗	yipdw	I'd like to know of specifics
05:49 ^🔗	ryonaloli	i mean, /g/ would love this kind of thing
05:49 ^🔗	yipdw	and the best bang-for-time changes
05:49 ^🔗	ryonaloli	specifics? it's the overhead of running an entire OS
05:50 ^🔗	joepie92	yipdw: one specific I have pointed out before (that is being sidestepped nowadays by running manually) is that running an entire virtualbox VM on a low-end VPS is absolutely not feasible
05:50 ^🔗	joepie92	but as I already said, that is not hard to solve
05:50 ^🔗	joepie92	just adapt warrior updating code to run outside the warrior
05:50 ^🔗	yipdw	joepie92: yeah, but so far the workaround for that is "run the seesaw scripts directly"
05:50 ^🔗	joepie92	yes, I know
05:50 ^🔗	yipdw	and that seems to be doing well for the current audience
05:50 ^🔗	JRWR	that sounds messy
05:50 ^🔗	yipdw	it really isn't
05:50 ^🔗	joepie92	yipdw: I really would prefer automated updating; now I need to SSH into xx boxes every time a new project appears
05:50 ^🔗	joepie92	I'd like that to be automatic
05:51 ^🔗	joepie92	but it's doable now
05:51 ^🔗	yipdw	you could extract that from the Warrior, run that as a Seesaw task
05:51 ^🔗	joepie92	yipdw: I was around when we just started doing this manual running thing of pipelines; it has much improved since then :)
05:51 ^🔗	ryonaloli	how complicated can it be to make a website grabber that gets instructions from a c&c... i mean, you could just rewrite it all in scratch, no?
05:51 ^🔗	ryonaloli	*from scratch
05:51 ^🔗	joepie92	ryonaloli: there is absolutely no need to rewrite anything from scratch
05:51 ^🔗	joepie92	there is already a solid framework
05:52 ^🔗	yipdw	ryonaloli: it's not that hard, it's just that the payoff so far isn't there
05:52 ^🔗	joepie92	also, JRWR, see http://github.com/joepie91/isohunt-grab
05:52 ^🔗	joepie92	that is the entirety of manual setup instructions
05:52 ^🔗	joepie92	virtually every seesaw project will have identical instructions
05:52 ^🔗	joepie92	most of it is set up once, run many times
05:52 ^🔗	ryonaloli	having separate scripts for every project is just nauseating imo...
05:52 ^🔗	Cameron_D	Why, every website is different
05:52 ^🔗	ryonaloli	i mean, sure it technically works, but there's the KISS princible to remember
05:52 ^🔗	yipdw	please look at the scripts
05:52 ^🔗	joepie92	ryonaloli: that is going to be the case anyway
05:52 ^🔗	Cameron_D	Need to handle them differently
05:52 ^🔗	joepie92	ryonaloli: you have to understand that these aren't separate -applications-
05:53 ^🔗	joepie92	they just define custom behaviour within the seesaw framework
05:53 ^🔗	yipdw	ryonaloli: this is KISS
05:53 ^🔗	joepie92	the warrior uses these exact same scripts - it just auto-downloads them
05:53 ^🔗	joepie92	that's it really
05:53 ^🔗	yipdw	a highly configurable framework and a strict task schema is falling on the other side of KISS
05:53 ^🔗	joepie92	why am I a 92?
05:54 ^🔗	yipdw	I mean, there's a very strong demonstration that this is KISS
05:54 ^🔗	yipdw	a grabber can be written, tested, and deployed in a couple of days, or less
05:54 ^🔗	joepie91	ryonaloli: read through this: https://github.com/ArchiveTeam/blip.tv-grab-video-only/blob/master/pipeline.py
05:54 ^🔗	joepie91	yipdw: uh, one day
05:54 ^🔗	joepie91	for isohunt-grab
05:54 ^🔗	yipdw	ok
05:54 ^🔗	ryonaloli	can't the tracker send the scripts instead of downloading manually?
05:54 ^🔗	ryonaloli	i mean, that'd make sense
05:54 ^🔗	joepie91	and isohunt-grab is fairly complicated
05:55 ^🔗	joepie91	ryonaloli; the tracker already tells the warriors where to find the scripts
05:55 ^🔗	joepie91	this already exists
05:55 ^🔗	joepie91	it just doesn't run outside the warrior VM yet
05:55 ^🔗	ryonaloli	yeah i know, but it's on a vm
05:55 ^🔗	joepie91	so make it not be in a VM!
05:55 ^🔗	ryonaloli	i can't into windows ;_;
05:55 ^🔗	joepie91	the infrastructure is already there
05:55 ^🔗	joepie91	exactly
05:55 ^🔗	joepie91	so our problem isn't code delivery
05:55 ^🔗	joepie91	our problem is platform compatibility
05:56 ^🔗	Cameron_D	which is what the VM solves
05:56 ^🔗	joepie91	and the compatibility issues aren't (exclusively) caused by the code
05:56 ^🔗	joepie91	but primarily by inherent platform restrictions
05:56 ^🔗	joepie91	filesystems etc.
05:56 ^🔗	ryonaloli	can't it be stored in a database of some kind?
05:56 ^🔗	joepie91	Cameron_D: for certain values of "solves", yes
05:57 ^🔗	joepie91	ryonaloli: can't what be stored in a database?
05:57 ^🔗	ryonaloli	well, you said there's issues with filesystem compatibility, and in the other channel i heard something about windows not storing filenames correctly in NTFS
05:58 ^🔗	joepie91	ryonaloli: databases aren't magic; in a situation like this, if anything, it would be a bottleneck
05:58 ^🔗	joepie91	(depending on your definition of 'database' - you could technically consider a filesystem to be a database)
05:58 ^🔗	yipdw	ryonaloli: I guess you could use sqlite or something, and distribute it; but what would be the benefit over telling people "get a VM or run it on these platforms that we know it works on"
05:58 ^🔗	Cameron_D	Well, to an extent the warc+cdx is a databse of sorts, but the file still needs to be downloaded and stored first
05:59 ^🔗	yipdw	we are not going for infinite scalability, and we so far do not need it
05:59 ^🔗	yipdw	no site has been able to actually exhaust Warrior capacity
05:59 ^🔗	ryonaloli	yipdw: because it's overly complicated, and keeps the average altruistic user from doing it
05:59 ^🔗	ryonaloli	that's one reason
05:59 ^🔗	odie5533	A lightweight alternative client might be useful.
05:59 ^🔗	odie5533	Especially for e.g. deploying on a remote server
05:59 ^🔗	joepie91	also, ryonaloli, I think that first of all you should become a bit more familiar with the internals of seesaw and the warrior
05:59 ^🔗	joepie91	because I feel like you misunderstand how it works
05:59 ^🔗	joepie91	on a technical level
05:59 ^🔗	yipdw	ryonaloli: "overly complicated" is a bit of a difficult metric, but ok
05:59 ^🔗	Cameron_D	odie5533: that already exists in the form of the seesaw script
06:00 ^🔗	odie5533	Cameron_D: then I guess we have everything!
06:00 ^🔗	Cameron_D	https://github.com/joepie91/isohunt-grab#running-without-a-warrior
06:00 ^🔗	odie5533	maybe some refactoring, perhaps a lightweight Windows native script, but the major pieces are there.
06:01 ^🔗	odie5533	and I use the term natively only somewhat loosely: could run in the JVM or something.
06:01 ^🔗	yipdw	hmm
06:01 ^🔗	joepie91	again, the problem here is the lack of auto-updating really
06:01 ^🔗	joepie91	(re: seesaw)
06:01 ^🔗	joepie91	but that is not a hard problem to solve
06:02 ^🔗	joepie91	just requires work
06:02 ^🔗	odie5533	What do you mean auto-updating?
06:02 ^🔗	yipdw	the warrior updates its scripts on a periodic basis
06:02 ^🔗	yipdw	that's a shell script
06:02 ^🔗	joepie91	odie5533; automatically retrieving new versions of the pipeline code used to grab websites, and retrieving code for new projects
06:02 ^🔗	joepie91	like the warrior does
06:02 ^🔗	yipdw	you could just take that and run it as a separate process in a seesaw project
06:03 ^🔗	joepie91	I kind of need to continue doing the dishes
06:03 ^🔗	joepie91	if there's a question specific to me, prefix with my name so I won't miss it when I come back
06:29 ^🔗	joepie91	for reference, what we need most in my opinion for warrior/seesaw, in this order: complete seesaw documentation, a non-VM Warrior-like wrapper for Linux systems, a spot checking mechanism against potential bad clients, and a pure Python replacement for wget
06:29 ^🔗	joepie91	(going back to dishes, just figured I'd write that down here before I forgot)
06:32 ^🔗	joepie91	actually fuck dishes, I have code to write
06:32 ^🔗	joepie91	and a server to upgrade/migrate
06:33 ^🔗	joepie91	and... /me eye-scrolls down todo list
06:33 ^🔗	odie5533	joepie91: I wrote, somewhat, a python replacement for wget
06:33 ^🔗	odie5533	https://github.com/odie5533/WarcMiddleware
06:34 ^🔗	joepie91	odie5533: is it a library? or a tool?
06:34 ^🔗	odie5533	both!
06:34 ^🔗	odie5533	you can basically just write a Scrapy script to mirror a site, and WarcMiddleware handles the Warc-saving part of it
06:35 ^🔗	odie5533	Or you can use the included script, which has a CLI
06:35 ^🔗	odie5533	so you can either write your own scripts, or use command line parameters. Writing your own is more powerful, but the command line params work for most stuff
06:35 ^🔗	joepie91	mm.
06:36 ^🔗	joepie91	odie5533: if you throw horribly broken HTML at Scrapy, what does it do?
06:36 ^🔗	joepie91	:P
06:36 ^🔗	joepie91	and/or what does it not do
06:36 ^🔗	odie5533	It uses lxml, which is a very powerful parsing library
06:36 ^🔗	odie5533	And if it doesn't parse right, you can just modify the parser relatively easily
06:36 ^🔗	joepie91	afaik lxml stumbles pretty easily
06:36 ^🔗	joepie91	over non-well-formatted documents
06:36 ^🔗	odie5533	here's the parser: https://github.com/odie5533/WarcMiddleware/blob/master/crawltest/spiders/simplespider.py
06:37 ^🔗	joepie91	unless it speaks beautifulsoup nowadays
06:37 ^🔗	odie5533	it does
06:37 ^🔗	odie5533	and has for years
06:37 ^🔗	*	joepie91 raises eyebrow
06:37 ^🔗	odie5533	but the default parser seems to work quite well. And regardless, you can use any parser you want by just modifying the simplespider.py
06:37 ^🔗	joepie91	right
06:38 ^🔗	joepie91	what is resource usage of scrapy like?
06:38 ^🔗	odie5533	I don't really know
06:38 ^🔗	odie5533	It uses Twisted and Python, so I assume not bad.
06:38 ^🔗	joepie91	... twisted
06:38 ^🔗	joepie91	:\|
06:38 ^🔗	odie5533	eh?
06:38 ^🔗	odie5533	I love Twisted.
06:38 ^🔗	*	joepie91 has very bad experiences with Twisted on OpenVZ
06:38 ^🔗	joepie91	leaks like a motherfucker
06:38 ^🔗	joepie91	also, I despite Twisted
06:39 ^🔗	joepie91	can you use Scrapy without ever touching Twisted?
06:39 ^🔗	yipdw	so, one problem that we have right now is that sites that have a few hundred thousand URLs consume large amounts of memory in wget
06:39 ^🔗	odie5533	joepie91: most people do.
06:39 ^🔗	joepie91	s/despite/despise/
06:39 ^🔗	yipdw	I'd like to know the memory usage characteristics of scrapy on such a target
06:39 ^🔗	joepie91	seconding what yipdw said
06:39 ^🔗	odie5533	joepie91: internally, my WarcMiddleware hooks into Twisted. But you don't need to know about that. Just writing Scrapy scripts is basically not any Twisted
06:39 ^🔗	yipdw	by "large", I mean around 200 MB
06:39 ^🔗	joepie91	odie5533: right, that would fall under 'acceptable' then
06:39 ^🔗	odie5533	yipdw: I don't think it solves that.
06:39 ^🔗	odie5533	but I am working on a solution right now.
06:39 ^🔗	yipdw	so it's not THAT large, but it can be a problem
06:39 ^🔗	joepie91	though I'm still not happy about the size of Twisted as a dependency
06:39 ^🔗	yipdw	it's a problem for ArchiveBot, at least
06:40 ^🔗	odie5533	yipdw: the URL strings are too big?
06:40 ^🔗	odie5533	joepie91: it's like 2 MB?
06:40 ^🔗	yipdw	odie5533: no, it's just a lot of URLs
06:40 ^🔗	odie5533	yipdw: I don't understand.
06:40 ^🔗	yipdw	odie5533: wget keeps track of what it's seen during a recursive retrieval; the amount of storage required is directly proportional to the number of URLs it's seen
06:40 ^🔗	odie5533	Are the URL strings too big to hold in memory, or does it just trip up if you give it too many URL strings?
06:41 ^🔗	yipdw	er, it has
06:41 ^🔗	joepie91	odie5533: if I am not mistaken, it pulls in quite a lot of dependencies
06:41 ^🔗	odie5533	joepie91: You are mistaken
06:41 ^🔗	odie5533	it uses Zope Interface and OpenSSL. that's it
06:41 ^🔗	yipdw	odie5533: the amount of memory wget uses for each new URL is small, but it does add up; and once you get closer to a million URLs, the memory required becomes problematic for some applications
06:42 ^🔗	yipdw	I'm mentioning this only because this is an ArchiveBot issue
06:42 ^🔗	yipdw	it is emphatically NOT a problem for most warrior applications
06:42 ^🔗	odie5533	yipdw: I assume Scrapy could handle it fine, though I'm not sure my WarcMiddleware is up to the task, but it might be
06:42 ^🔗	yipdw	ok
06:42 ^🔗	yipdw	I'll take a look at it
06:43 ^🔗	odie5533	yipdw: the --url-file command inputs a list of urls
06:44 ^🔗	odie5533	If you are going to try WarcMiddleware, please read the INSTALL.md file. The installation is slightly complicated because you need to use old versions of Twisted and Scrapy.
06:44 ^🔗	yipdw	odie5533: wget has something similar, but read-from-file mode is a bit different
06:44 ^🔗	yipdw	at least operationally; they could very well use the same code paths
06:46 ^🔗	odie5533	yipdw: for archivebot, you could write a WgetManager that only sends it a partial list of urls
06:46 ^🔗	yipdw	odie5533: streaming in a list is possible, but complicates handling pipeline failure
06:46 ^🔗	joepie91	odie5533: can you update it to work with newest Twisted and Scrapy..?
06:46 ^🔗	yipdw	at least in seesaw
06:46 ^🔗	*	joepie91 hates unstable deps
06:46 ^🔗	odie5533	joepie91: Afraid not... =/
06:46 ^🔗	odie5533	I don't like them either
06:46 ^🔗	joepie91	how so?
06:47 ^🔗	odie5533	but Twisted made major internal changes to support HTTP 1.1, and in those changes it makes it much more difficult to hook into.
06:48 ^🔗	joepie91	urgh
06:48 ^🔗	odie5533	so I could either force HTTP1.0, or, the method I chose, is to use a proxy.
06:48 ^🔗	*	joepie91 has no plans of moving Twisted out of his "dislike" list any time soon
06:48 ^🔗	odie5533	then I don't need to use hacks to hook into Scrapy/Twisted.
06:48 ^🔗	odie5533	I wrote a couple proxies actually :D
06:48 ^🔗	joepie91	odie5533: proxy in what sense?
06:48 ^🔗	odie5533	web proxy
06:49 ^🔗	joepie91	wouldn't that fuck the headers?
06:49 ^🔗	odie5533	not really
06:49 ^🔗	odie5533	it can still record the exact headers sent to and from the server
06:49 ^🔗	yipdw	odie5533: on a completely unrelated note, I think a PhantomJS controller is something else that'd be useful, especially given that many sites no longer work by just freeze-drying resources
06:50 ^🔗	yipdw	I did write a PhantomJS + WarcProxy pipeline once, but it was unstable as hell
06:50 ^🔗	odie5533	my WarcProxy?
06:50 ^🔗	yipdw	yes
06:50 ^🔗	odie5533	yipdw: a better warc proxy would fix it I think
06:50 ^🔗	yipdw	WarcProxy was fine
06:50 ^🔗	yipdw	the pipeline control was the problem
06:50 ^🔗	odie5533	well, it doesn't do SSL
06:50 ^🔗	odie5533	the new one does SSL: https://github.com/odie5533/WarcMITMProxy
06:50 ^🔗	yipdw	oh, that was irrelevant for the purpose of this project
06:51 ^🔗	yipdw	I suspect WarcMITMProxy could be substituted and the problems would remain
06:51 ^🔗	odie5533	well, it uses a completely different library
06:51 ^🔗	yipdw	it wasn't WarcProxy that was the issue; it was controlling the pipeline
06:51 ^🔗	odie5533	so it might solve the problems. I don't know though
06:51 ^🔗	odie5533	ah
06:51 ^🔗	odie5533	is phantomjs like ghostjs?
06:51 ^🔗	yipdw	the setup was a bit of a mess of poltergeist + capybara + PhantomJS + WarcProxy
06:51 ^🔗	yipdw	it worked, but required a lot of babysitting
06:52 ^🔗	yipdw	I think something that could plug into Seesaw as an alternative fetcher would be very useful for making the Warrior more useful for Javascript monstrosites
06:52 ^🔗	odie5533	well that's sort of why the proxy is a great thing since you just need to write the scraping stuff and let the proxy handle the Warc details
06:52 ^🔗	yipdw	odie5533: phantomjs is headless Webkit
06:52 ^🔗	yipdw	made Javascript-scriptable
06:52 ^🔗	joepie91	oh! this reminds me
06:53 ^🔗	odie5533	I was thinking of Ghost.py, not ghost.js (blog)
06:53 ^🔗	joepie91	yipdw: the earlier realize thing that was used for puush and isohunt should probably be a part of seesaw
06:53 ^🔗	odie5533	Ghost.py I think is not headless. It's a full client
06:53 ^🔗	joepie91	so it can be used for interpolating ranges
06:53 ^🔗	odie5533	yipdw: using a headless webkit is going to be really beefy
06:53 ^🔗	yipdw	joepie91: I don't have release access on seesaw
06:54 ^🔗	joepie91	yipdw: just making a remark :)
06:54 ^🔗	yipdw	odie5533: yeah, but beefiness is not really my concern -- a good archive is
06:54 ^🔗	odie5533	ah
06:54 ^🔗	joepie91	so that it doesn't get lost in my forgetfulness
06:54 ^🔗	yipdw	and I'd rather that be done in the simplest way
06:54 ^🔗	joepie91	also, yipdw, define "release access"
06:54 ^🔗	yipdw	joepie91: I can't push new Seesaw versions
06:54 ^🔗	odie5533	did the freeze thing not work?
06:54 ^🔗	joepie91	on? github? pypi?
06:54 ^🔗	yipdw	only alard can do that, AFAIK
06:54 ^🔗	joepie91	because I maintain the pypi package
06:54 ^🔗	yipdw	oh
06:54 ^🔗	yipdw	well, ok
06:55 ^🔗	odie5533	yipdw: https://github.com/iramari/flashfreeze
06:55 ^🔗	joepie91	yipdw: alard just bothers me every time it needs to be updated :)
06:55 ^🔗	joepie91	(which is fine with me)
06:55 ^🔗	yipdw	odie5533: I haven't tried it
06:55 ^🔗	yipdw	the main reason I used PhantomJS is that I was familiar with its capabilities
06:56 ^🔗	yipdw	and I knew that it would get me a good archive
06:56 ^🔗	yipdw	flashfreeze is something else to try
06:56 ^🔗	odie5533	I did a comparison of them on my WarcProxy page: https://github.com/odie5533/WarcProxy
06:56 ^🔗	odie5533	it works pretty well, though it misses a little bit
06:57 ^🔗	yipdw	I'm curious to see how that works on e.g. Discourse forums
06:57 ^🔗	odie5533	should work
06:57 ^🔗	odie5533	it runs a full Qt WebKit
06:57 ^🔗	yipdw	ok
06:58 ^🔗	yipdw	I'll give that a shot in ArchiveBot
06:58 ^🔗	yipdw	(someday)
06:58 ^🔗	odie5533	I think FlashFreeze doesn't output warc though
06:58 ^🔗	yipdw	oh
06:58 ^🔗	yipdw	hm
06:59 ^🔗	odie5533	it outputs URLs and then sends them to wget I think...
06:59 ^🔗	yipdw	wget outputs WARC, I guess that's enough
06:59 ^🔗	odie5533	so it outputs all the asset urls
06:59 ^🔗	odie5533	it's sort of enough. If there are sessions or cookies or something then wget might not be able to grab them
07:00 ^🔗	yipdw	ah, right
07:00 ^🔗	odie5533	also uses 2x bandwidth

irclogger-viewer