#warrior 2013-02-24,Sun

↑back Search

Time Nickname Message
03:52 🔗 omf_ It this really the most updated code for the warrior? https://github.com/ArchiveTeam/warrior-code
09:14 🔗 ersi omf_: Yes, do check if there's a development branch though
09:59 🔗 soultcer omf_, It's warrior-code2, not warrior-code
10:05 🔗 omf_ got it
10:05 🔗 omf_ I have built some deployable linux vms before. I am also interested in looking at how people do it.
10:06 🔗 soultcer Well the actualy building of the vm happens in another git repo called warrior-preseed
10:06 🔗 soultcer It's basically just a customized debian install using the debian installer's preseeding options
10:06 🔗 soultcer I'm currently playing around with it because I want to slim it down a bit
10:10 🔗 omf_ yeah I am familiar with the preseed features. I used it to build custom ubuntu installs for a non-profit
10:10 🔗 omf_ It really sped up their process
10:11 🔗 omf_ One trick I use is to scrub stuff out on the end of build hook
10:13 🔗 soultcer There is a script in warrior-code2 that removes all unneeded files
10:13 🔗 soultcer And some aptitude purge magic that deletes unwanted packages in the preseed file itself
10:13 🔗 omf_ Have you done any benchmarking
10:13 🔗 omf_ I know for size you have
10:14 🔗 omf_ what about RAM usage and boot speed
10:14 🔗 soultcer I don't think anyone has done such benchmarks, but you would have to ask alard to be sure. He is the genius that came up with the warrior
10:15 🔗 omf_ kiwi and cobbler are pretty advanced in those kinds of things
10:16 🔗 soultcer I never tried these, but my next step after cleaning up the package lists is to write a plugin for this: https://github.com/andsens/ec2debian-build-ami
10:16 🔗 soultcer Then we could have a warrior AMI that anyone can just run
10:17 🔗 omf_ kiwi is the opensuse distro builder, cobbler is the redhat distro builder
10:17 🔗 omf_ they both build to multiple formats by default including EC2 instances
10:18 🔗 omf_ the opensuse ones can be built and deployed from just the web
10:18 🔗 soultcer Well we are a debian shop, so we have to use what debian gives us :D
10:20 🔗 omf_ You should use what is best and everything is open source. Hench benchmarks
10:21 🔗 omf_ Like deploying a multithreaded scraper for example has different bottlenecks
10:23 🔗 soultcer I think the problem is that we don't have the manpower to do all that
10:23 🔗 omf_ most of it can be automated away
10:23 🔗 omf_ so not too much people power
10:24 🔗 omf_ We just need a slow strech
10:24 🔗 omf_ not fucking everything is dying at once bullshit
10:26 🔗 soultcer You forget that Archiveteam is a hobby project. Even if there is no project dying right now, there is not always time for new features
10:27 🔗 omf_ I am talking about just my personal time there
10:27 🔗 soultcer Oh, well then code away ;-)
10:27 🔗 soultcer All the code we have is on github
10:28 🔗 omf_ I have been looking through it
10:28 🔗 soultcer One thing which I think is important though is that more than one member needs to understand each project.
10:29 🔗 omf_ yes I agree
10:29 🔗 omf_ the hit by a bus problem
10:29 🔗 omf_ all the code is shared but is all the knowledge and process documented
10:30 🔗 soultcer The warrior documentation is pretty good and we have many people who are Debian users who can help
10:31 🔗 soultcer On the other hand we also have projects where the code is pretty much a mess and only one person knows how everything works
10:31 🔗 omf_ do we have good references for warc and cdx files
10:31 🔗 omf_ like I had to read the iso standard
10:31 🔗 omf_ and shit like that sucks the life out of you
10:32 🔗 soultcer Well, warc is an ISO standard so that pretty much is the reference
10:33 🔗 soultcer The Internet Archive can only add warc files to the wayback machine, not tar files made with wget, so we are stuck with that
10:33 🔗 omf_ it can handle the compressed warcs right?
10:34 🔗 soultcer It's just gzip compression of the whole file I think
10:35 🔗 soultcer But I never had to work with warc files, so in doubt ask alard or underscor ;-)
10:45 🔗 omf_ Most my current work is on url mapping, domain mapping and other things to make sure content coverage is good
10:45 🔗 omf_ all this can then be folded into preventative backups of key sites
11:16 🔗 alard omf_: The most important part of the warrior is the ArchiveTeam/seesaw-kit repository. You should install that and use the run-warrior command to start the warrior. ArchiveTeam/warrior-preseed and ArchiveTeam/warrior-code2 are specific to the vm image. You could use these as inspiration for your own. (It's very useful to use the same Debian distribution, though, or you'll have to compile your own Wget+Lua binaries.)
11:17 🔗 alard omf_: Benchmarks really depend on the project. Wget doesn't need a lot of memory, unless it finds a site with a lot of URLs, then no amount of memory is enough.
11:18 🔗 alard The warrior VM uses 400MB of RAM, and that has generally worked so far. (And that's about the only benchmark we have.)
11:19 🔗 omf_ That is interesting to know
11:21 🔗 alard omf_: The ISO standard for WARC isn't that bad, I think, and there's also the http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 to go with it.
11:22 🔗 alard It's probably important to add that "compressed warcs" are compressed *per warc record*, so you can easily extract individual records. It's not just a gzipped file.
11:24 🔗 omf_ I am going to read that pdf alard
13:07 🔗 soultcer alard: Is there a reason why you have build-essential installed on the warrior vm?
13:10 🔗 alard Not that I know of. I started with a normal Debian installation and removed the things I thought could be missed.
13:10 🔗 alard Is is very large?
13:11 🔗 alard Perhaps I thought that compiling things was necessary, for pip installs, and this was "essential".
13:11 🔗 soultcer Well it pulls in a gcc so it takes a little bit of space
13:11 🔗 soultcer I am trying to get a list of all "warrior required" packages so that I can create an amazon ec2 image
13:11 🔗 alard gcc is probably useful.
14:42 🔗 chazchaz iirc, one of the dependencies of seesaw has a module that compiles a C or C++ accelerated version if you have the proper dev packages installed
14:43 🔗 chazchaz soultcer
14:43 🔗 chazchaz simplejson
14:46 🔗 soultcer Thanks
14:48 🔗 chazchaz I think it needs build-essential and some python dev package
15:56 🔗 db48x22 alard: btw, I pushed the rest of those changes to my pull request. want to merge them or should I just do it?
16:50 🔗 soultcer chazchaz: I just set up the warrior without gcc and you are absolutely right, simplejson complains that it will be installed without setups
16:50 🔗 soultcer But shouldn't the json module in python 2.6 work just as well/fast?
16:50 🔗 db48x22 it works just as well, and I doubt the speed difference is noticable
16:51 🔗 db48x22 all we need to do is parse the occasional request for more work
16:56 🔗 ersi soultcer: No, each WARCRecord is gzipped. Not the whole file
16:56 🔗 ersi So it's basicaly a bunch of gzip streams in one file
16:57 🔗 ersi oh, I missed that alard filled that in already. Sorry.
23:19 🔗 omf_ Is this the best channel we have to talk about scrapers?

irclogger-viewer