#warrior 2013-02-24,Sun

↑back Search

Time	Nickname	Message
03:52 ^🔗	omf_	It this really the most updated code for the warrior? https://github.com/ArchiveTeam/warrior-code
09:14 ^🔗	ersi	omf_: Yes, do check if there's a development branch though
09:59 ^🔗	soultcer	omf_, It's warrior-code2, not warrior-code
10:05 ^🔗	omf_	got it
10:05 ^🔗	omf_	I have built some deployable linux vms before. I am also interested in looking at how people do it.
10:06 ^🔗	soultcer	Well the actualy building of the vm happens in another git repo called warrior-preseed
10:06 ^🔗	soultcer	It's basically just a customized debian install using the debian installer's preseeding options
10:06 ^🔗	soultcer	I'm currently playing around with it because I want to slim it down a bit
10:10 ^🔗	omf_	yeah I am familiar with the preseed features. I used it to build custom ubuntu installs for a non-profit
10:10 ^🔗	omf_	It really sped up their process
10:11 ^🔗	omf_	One trick I use is to scrub stuff out on the end of build hook
10:13 ^🔗	soultcer	There is a script in warrior-code2 that removes all unneeded files
10:13 ^🔗	soultcer	And some aptitude purge magic that deletes unwanted packages in the preseed file itself
10:13 ^🔗	omf_	Have you done any benchmarking
10:13 ^🔗	omf_	I know for size you have
10:14 ^🔗	omf_	what about RAM usage and boot speed
10:14 ^🔗	soultcer	I don't think anyone has done such benchmarks, but you would have to ask alard to be sure. He is the genius that came up with the warrior
10:15 ^🔗	omf_	kiwi and cobbler are pretty advanced in those kinds of things
10:16 ^🔗	soultcer	I never tried these, but my next step after cleaning up the package lists is to write a plugin for this: https://github.com/andsens/ec2debian-build-ami
10:16 ^🔗	soultcer	Then we could have a warrior AMI that anyone can just run
10:17 ^🔗	omf_	kiwi is the opensuse distro builder, cobbler is the redhat distro builder
10:17 ^🔗	omf_	they both build to multiple formats by default including EC2 instances
10:18 ^🔗	omf_	the opensuse ones can be built and deployed from just the web
10:18 ^🔗	soultcer	Well we are a debian shop, so we have to use what debian gives us :D
10:20 ^🔗	omf_	You should use what is best and everything is open source. Hench benchmarks
10:21 ^🔗	omf_	Like deploying a multithreaded scraper for example has different bottlenecks
10:23 ^🔗	soultcer	I think the problem is that we don't have the manpower to do all that
10:23 ^🔗	omf_	most of it can be automated away
10:23 ^🔗	omf_	so not too much people power
10:24 ^🔗	omf_	We just need a slow strech
10:24 ^🔗	omf_	not fucking everything is dying at once bullshit
10:26 ^🔗	soultcer	You forget that Archiveteam is a hobby project. Even if there is no project dying right now, there is not always time for new features
10:27 ^🔗	omf_	I am talking about just my personal time there
10:27 ^🔗	soultcer	Oh, well then code away ;-)
10:27 ^🔗	soultcer	All the code we have is on github
10:28 ^🔗	omf_	I have been looking through it
10:28 ^🔗	soultcer	One thing which I think is important though is that more than one member needs to understand each project.
10:29 ^🔗	omf_	yes I agree
10:29 ^🔗	omf_	the hit by a bus problem
10:29 ^🔗	omf_	all the code is shared but is all the knowledge and process documented
10:30 ^🔗	soultcer	The warrior documentation is pretty good and we have many people who are Debian users who can help
10:31 ^🔗	soultcer	On the other hand we also have projects where the code is pretty much a mess and only one person knows how everything works
10:31 ^🔗	omf_	do we have good references for warc and cdx files
10:31 ^🔗	omf_	like I had to read the iso standard
10:31 ^🔗	omf_	and shit like that sucks the life out of you
10:32 ^🔗	soultcer	Well, warc is an ISO standard so that pretty much is the reference
10:33 ^🔗	soultcer	The Internet Archive can only add warc files to the wayback machine, not tar files made with wget, so we are stuck with that
10:33 ^🔗	omf_	it can handle the compressed warcs right?
10:34 ^🔗	soultcer	It's just gzip compression of the whole file I think
10:35 ^🔗	soultcer	But I never had to work with warc files, so in doubt ask alard or underscor ;-)
10:45 ^🔗	omf_	Most my current work is on url mapping, domain mapping and other things to make sure content coverage is good
10:45 ^🔗	omf_	all this can then be folded into preventative backups of key sites
11:16 ^🔗	alard	omf_: The most important part of the warrior is the ArchiveTeam/seesaw-kit repository. You should install that and use the run-warrior command to start the warrior. ArchiveTeam/warrior-preseed and ArchiveTeam/warrior-code2 are specific to the vm image. You could use these as inspiration for your own. (It's very useful to use the same Debian distribution, though, or you'll have to compile your own Wget+Lua binaries.)
11:17 ^🔗	alard	omf_: Benchmarks really depend on the project. Wget doesn't need a lot of memory, unless it finds a site with a lot of URLs, then no amount of memory is enough.
11:18 ^🔗	alard	The warrior VM uses 400MB of RAM, and that has generally worked so far. (And that's about the only benchmark we have.)
11:19 ^🔗	omf_	That is interesting to know
11:21 ^🔗	alard	omf_: The ISO standard for WARC isn't that bad, I think, and there's also the http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 to go with it.
11:22 ^🔗	alard	It's probably important to add that "compressed warcs" are compressed per warc record, so you can easily extract individual records. It's not just a gzipped file.
11:24 ^🔗	omf_	I am going to read that pdf alard
13:07 ^🔗	soultcer	alard: Is there a reason why you have build-essential installed on the warrior vm?
13:10 ^🔗	alard	Not that I know of. I started with a normal Debian installation and removed the things I thought could be missed.
13:10 ^🔗	alard	Is is very large?
13:11 ^🔗	alard	Perhaps I thought that compiling things was necessary, for pip installs, and this was "essential".
13:11 ^🔗	soultcer	Well it pulls in a gcc so it takes a little bit of space
13:11 ^🔗	soultcer	I am trying to get a list of all "warrior required" packages so that I can create an amazon ec2 image
13:11 ^🔗	alard	gcc is probably useful.
14:42 ^🔗	chazchaz	iirc, one of the dependencies of seesaw has a module that compiles a C or C++ accelerated version if you have the proper dev packages installed
14:43 ^🔗	chazchaz	soultcer
14:43 ^🔗	chazchaz	simplejson
14:46 ^🔗	soultcer	Thanks
14:48 ^🔗	chazchaz	I think it needs build-essential and some python dev package
15:56 ^🔗	db48x22	alard: btw, I pushed the rest of those changes to my pull request. want to merge them or should I just do it?
16:50 ^🔗	soultcer	chazchaz: I just set up the warrior without gcc and you are absolutely right, simplejson complains that it will be installed without setups
16:50 ^🔗	soultcer	But shouldn't the json module in python 2.6 work just as well/fast?
16:50 ^🔗	db48x22	it works just as well, and I doubt the speed difference is noticable
16:51 ^🔗	db48x22	all we need to do is parse the occasional request for more work
16:56 ^🔗	ersi	soultcer: No, each WARCRecord is gzipped. Not the whole file
16:56 ^🔗	ersi	So it's basicaly a bunch of gzip streams in one file
16:57 ^🔗	ersi	oh, I missed that alard filled that in already. Sorry.
23:19 ^🔗	omf_	Is this the best channel we have to talk about scrapers?

irclogger-viewer