[03:52] <omf_> It this really the most updated code for the warrior? https://github.com/ArchiveTeam/warrior-code
[09:14] <ersi> omf_: Yes, do check if there's a development branch though
[09:59] <soultcer> omf_, It's warrior-code2, not warrior-code
[10:05] <omf_> got it
[10:05] <omf_> I have built some deployable linux vms before. I am also interested in looking at how people do it.
[10:06] <soultcer> Well the actualy building of the vm happens in another git repo called warrior-preseed
[10:06] <soultcer> It's basically just a customized debian install using the debian installer's preseeding options
[10:06] <soultcer> I'm currently playing around with it because I want to slim it down a bit
[10:10] <omf_> yeah I am familiar with the preseed features. I used it to build custom ubuntu installs for a non-profit
[10:10] <omf_> It really sped up their process
[10:11] <omf_> One trick I use is to scrub stuff out on the end of build hook
[10:13] <soultcer> There is a script in warrior-code2 that removes all unneeded files
[10:13] <soultcer> And some aptitude purge magic that deletes unwanted packages in the preseed file itself
[10:13] <omf_> Have you done any benchmarking
[10:13] <omf_> I know for size you have
[10:14] <omf_> what about RAM usage and boot speed
[10:14] <soultcer> I don't think anyone has done such benchmarks, but you would have to ask alard to be sure. He is the genius that came up with the warrior
[10:15] <omf_> kiwi and cobbler are pretty advanced in those kinds of things
[10:16] <soultcer> I never tried these, but my next step after cleaning up the package lists is to write a plugin for this: https://github.com/andsens/ec2debian-build-ami
[10:16] <soultcer> Then we could have a warrior AMI that anyone can just run
[10:17] <omf_> kiwi is the opensuse distro builder, cobbler is the redhat distro builder
[10:17] <omf_> they both build to multiple formats by default including EC2 instances
[10:18] <omf_> the opensuse ones can be built and deployed from just the web
[10:18] <soultcer> Well we are a debian shop, so we have to use what debian gives us :D
[10:20] <omf_> You should use what is best and everything is open source. Hench benchmarks
[10:21] <omf_> Like deploying a multithreaded scraper for example has different bottlenecks
[10:23] <soultcer> I think the problem is that we don't have the manpower to do all that
[10:23] <omf_> most of it can be automated away
[10:23] <omf_> so not too much people power
[10:24] <omf_> We just need a slow strech
[10:24] <omf_> not fucking everything is dying at once bullshit
[10:26] <soultcer> You forget that Archiveteam is a hobby project. Even if there is no project dying right now, there is not always time for new features
[10:27] <omf_> I am talking about just my personal time there
[10:27] <soultcer> Oh, well then code away ;-)
[10:27] <soultcer> All the code we have is on github
[10:28] <omf_> I have been looking through it
[10:28] <soultcer> One thing which I think is important though is that more than one member needs to understand each project.
[10:29] <omf_> yes I agree
[10:29] <omf_> the hit by a bus problem
[10:29] <omf_> all the code is shared but is all the knowledge and process documented
[10:30] <soultcer> The warrior documentation is pretty good and we have many people who are Debian users who can help
[10:31] <soultcer> On the other hand we also have projects where the code is pretty much a mess and only one person knows how everything works
[10:31] <omf_> do we have good references for warc and cdx files
[10:31] <omf_> like I had to read the iso standard
[10:31] <omf_> and shit like that sucks the life out of you
[10:32] <soultcer> Well, warc is an ISO standard so that pretty much is the reference
[10:33] <soultcer> The Internet Archive can only add warc files to the wayback machine, not tar files made with wget, so we are stuck with that
[10:33] <omf_> it can handle the compressed warcs right?
[10:34] <soultcer> It's just gzip compression of the whole file I think
[10:35] <soultcer> But I never had to work with warc files, so in doubt ask alard or underscor ;-)
[10:45] <omf_> Most my current work is on url mapping, domain mapping and other things to make sure content coverage is good
[10:45] <omf_> all this can then be folded into preventative backups of key sites
[11:16] <alard> omf_: The most important part of the warrior is the ArchiveTeam/seesaw-kit repository. You should install that and use the run-warrior command to start the warrior. ArchiveTeam/warrior-preseed and ArchiveTeam/warrior-code2 are specific to the vm image. You could use these as inspiration for your own. (It's very useful to use the same Debian distribution, though, or you'll have to compile your own Wget+Lua binaries.)
[11:17] <alard> omf_: Benchmarks really depend on the project. Wget doesn't need a lot of memory, unless it finds a site with a lot of URLs, then no amount of memory is enough.
[11:18] <alard> The warrior VM uses 400MB of RAM, and that has generally worked so far. (And that's about the only benchmark we have.)
[11:19] <omf_> That is interesting to know
[11:21] <alard> omf_: The ISO standard for WARC isn't that bad, I think, and there's also the http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 to go with it.
[11:22] <alard> It's probably important to add that "compressed warcs" are compressed *per warc record*, so you can easily extract individual records. It's not just a gzipped file.
[11:24] <omf_> I am going to read that pdf alard
[13:07] <soultcer> alard: Is there a reason why you have build-essential installed on the warrior vm?
[13:10] <alard> Not that I know of. I started with a normal Debian installation and removed the things I thought could be missed.
[13:10] <alard> Is is very large?
[13:11] <alard> Perhaps I thought that compiling things was necessary, for pip installs, and this was "essential".
[13:11] <soultcer> Well it pulls in a gcc so it takes a little bit of space
[13:11] <soultcer> I am trying to get a list of all "warrior required" packages so that I can create an amazon ec2 image
[13:11] <alard> gcc is probably useful.
[14:42] <chazchaz> iirc, one of the dependencies of seesaw has a module that compiles a C or C++ accelerated version if you have the proper dev packages installed
[14:43] <chazchaz> soultcer
[14:43] <chazchaz> simplejson
[14:46] <soultcer> Thanks
[14:48] <chazchaz> I think it needs build-essential and some python dev package
[15:56] <db48x22> alard: btw, I pushed the rest of those changes to my pull request. want to merge them or should I just do it?
[16:50] <soultcer> chazchaz: I just set up the warrior without gcc and you are absolutely right, simplejson complains that it will be installed without setups
[16:50] <soultcer> But shouldn't the json module in python 2.6 work just as well/fast?
[16:50] <db48x22> it works just as well, and I doubt the speed difference is noticable
[16:51] <db48x22> all we need to do is parse the occasional request for more work
[16:56] <ersi> soultcer: No, each WARCRecord is gzipped. Not the whole file
[16:56] <ersi> So it's basicaly a bunch of gzip streams in one file
[16:57] <ersi> oh, I missed that alard filled that in already. Sorry.
[23:19] <omf_> Is this the best channel we have to talk about scrapers?