[12:59] <sep332> hey how come the "hackernews" username doesn't have a warrior icon next to it?
[13:00] <Cameron_D> may've been set up standalone
[13:01] <sep332> i know "hackernews" is the default username for the image that was posted there
[13:01] <sep332> so lots of people are using it
[13:03] <Cameron_D> yeah, it is standalone https://gist.github.com/duggan/5226732
[15:40] <soultcer> I think I found the bug with the warrior AMI
[15:40] <soultcer> Alard added an auto-reboot if the seesaw version was out of date
[15:41] <soultcer> And I made it so that there would be a specific ec2 branch on the seesaw-kit repo
[15:41] <soultcer> So it was always out of date and always rebooting
[15:52] <ersi> hah, ouch
[15:53] <soultcer> I had to make some changed to warrior-code2 so that it would work with ec2. E.g. it loads the config from ec2 userdata and so on
[15:58] <Layke> Hey. How many URLs are actually being archived? What are the actual rate limits on each IP?
[15:59] <Layke> (Referring to Yahoo)
[15:59] <sep332> for yahoo messages?
[15:59] <Layke> Yeah sorry. Just realised that this channel might be used for other things as well :)
[16:00] <soultcer> We don't know what kind of rate limiting yahoo uses, and we also don't know the amount of URLs, as the warriors constantly report back more URLs that need to be crawled
[16:00] <sep332> there is also a #burnthemessenger channel just for that
[16:01] <sep332> there's a lot of channels actually lol
[16:01] <Layke> Ah right. I see. A while ago, I needed to pull about 6 million pages from one of the APIs provided by Yahoo because they were starting to switch to a pay for model, and I wanted to get everything before that happened. I just used 90 AWS instances and auto killed them every hour.
[16:02] <soultcer> Clever ;-)
[16:02] <sep332> lol nice. sounds like what posterous is doing now :p
[16:02] <Layke> No idea where I stand legally, but I figure if I refuse their current terms of service, and say that I am sticking to their previous terms I'm in the clear.
[16:03] <Layke> But that was a useful exercise anyway. Only cost a few dollars as well.
[16:04] <Layke> How do you mean, that sounds like what posterous is doing? (I know they are shutting down..)
[16:05] <sep332> they're banning our ips every hour
[16:05] <Layke> Yeah,, they'll probably revert to entire ranges of AWS.
[16:06] <sep332> maybe. so far they haven't. their ban list overflowed at least once and old ips started working in less than a day sometimes haha
[16:06] <Layke> O lol. I've never heard of that before. I wonder what point in their stack they were banning IPs then
[16:07] <sep332> they've actually been fairly cooperative, but i'm still not sure we're going to make it in time
[16:10] <Layke> Is there a prebuilt AMI for AWS for the Yahoo messages?
[16:16] <IceKarma> so, uh, I have VMware, not VirtualBox. does anyone know anything about converting the VM from one to the other?
[16:16] <tobbez> I believe (at least recent versions) of vmware should support that image
[16:17] <sep332> it will convert for you automatically.
[16:17] <sep332> the only thing you have to do manually is move the second disk from 1:1 to 1:0.
[16:18] <IceKarma> tobbez, sep332, ah, excellent
[16:18] <IceKarma> 100 Mbps down, 5 up, here
[16:18] <IceKarma> but yes, I know it's rate-limited
[16:19] <sep332> Layke: I think alard had one
[16:19] <lukegb> Layke: yes.
[16:20] <lukegb> Layke: https://gist.github.com/lukegb/5228290 <-- your username goes in the userdata field
[16:22] <soultcer> alard: https://github.com/ArchiveTeam/warrior-preseed/commit/aa1429dd0f9150bd24ce5a0816712fd52d0fbcc6
[16:23] <Layke> Nice lukegb
[16:24] <alard> soultcer: What's that?
[16:24] <IceKarma> tobbez, sep332, ah, that _is_ easy. File|Open, change the file type filter, point it at the .ova, and voilÃ 
[16:24] <soultcer> The script I have been using to create a complete Warrior AMI
[16:24] <alard> Ah, nice.
[16:25] <soultcer> Next step is to add user/password protection for the web interface
[16:25] <IceKarma> hm, although it came up with an error and said something to the effect of "click Retry to try again with relaxed rules, but it might not work"
[16:25] <alard> IceKarma: The .ova doesn't work in at least some versions of VMware, if I remember correctly.
[16:25] <soultcer> I will have to figure out how websockets work first ;-)
[16:25] <IceKarma> alard, I have 8.0.4
[16:26] <alard> soultcer: Or just make something that lets you create an SSH tunnel.
[16:26] <lukegb> alard: in Workstation 9 it does, but you have to hit retry and then change the 2nd HDD to be on 1:0 instead of 1:1
[16:26] <soultcer> I think http auth is easier to use than ssh tunnels, especially on Windows where you'd have to use putty
[16:27] <IceKarma> lukegb, yeah, just did that, about to try booting it
[16:28] <IceKarma> and away it goes! =D
[16:29] <alard> soultcer: Didn't we already have something with a password? Or was that just your suggestion?
[16:29] <soultcer> It was my suggestion
[16:29] <soultcer> I unfortunately haven't gotten around to implementing it yet, as for posterous I settled on creating an AMI that only contains seesaw-kit, not the full warrior
[16:30] <IceKarma> I'd like to give props to the people who set up this VM: other than the thing with the import and then needing to change that disk's configuration, it worked flawlessly, and the management interface is really slick.
[16:30] <IceKarma> excellent level of polish
[16:32] <tobbez> What would be the easiest way if I want to run the archiver outside a vm? Is the code in the warrior-code2 repo what I want?
[16:33] <alard> tobbez: No. You want the seesaw kit, pip install seesaw
[16:33] <alard> That gives you a run-pipeline command that you can use to run the pipeline.py scripts.
[16:33] <alard> tobbez: https://github.com/ArchiveTeam/yahoomessages-grab#running-without-a-warrior
[16:34] <Whoop> Has there been any plans to turn the archiver into a puppet module or similar?
[16:34] <tobbez> alard: Thanks
[16:35] <alard> Whoop: We prefer people running the warrior VM, so there's a common system. There are often dependencies, such as our modified version of Wget, that need to be compiled if you're not on exactly the same system.
[16:35] <Whoop> fair enough
[16:36] <alard> So if you really want to run it on your own, you should at least be able to set it up yourself.
[16:36] <Layke> How can I check that everything is running? I ran an AWS intsance
[16:37] <Layke> I see several wget-lua processes being kicked off regularly, but not sure how to check properly
[16:37] <Whoop> It was more to ease large scale deployments - that said, I wasnt aware there was whackyness such as modified wgets
[16:38] <alard> Whoop: Well, feel free to create your own puppet thing and share it.
[16:39] <Layke> Okay, I manually ran run-pipeline --concurrent 2 /home/ubuntu/yahoomessages-grab/pipeline.py Layke and can see things wokring. That looks good enough :)
[16:43] <tobbez> alard: Where does it store the data? Relative to the current directory?
[16:44] <alard> tobbez: Yes, I think it makes a data/ subdirectory. But you should check  run-pipeline --help , because I don't remember the details at the moment.
[16:44] <tobbez> alard: Didn't see anything in the --help output, that's why I asked
[16:45] <alard> Isn't there an option for the data directory?
[16:45] <tobbez> Not that I can see
[16:46] <alard> Ah, no, that's only in the run-warrior version (that's what's running on the warrior VM). So in that case I think it's always ./data/
[16:46] <tobbez> Alright, good
[17:14] <thomasbk> question: what's the exact rate yahoo limits at? (and how much bandwidth is that?)
[17:19] <ersi> Join #BurnTheMessenger for the Yahoo! Messanges archival project
[18:44] <daxelrod> Are there instructions for running Warrior without a VM?
[18:47] <ersi> You can run the scripts stand-alone, yes.
[18:48] <ersi> First and foremost, I recommend joining #BurnTheMessenger instead - since that's the Yahoo! Messages project channel
[18:49] <daxelrod> I'm there too
[19:05] <Gozer_> Hi all
[19:05] <alard> Gozer_: Hello.
[19:05] <Gozer_> Got 10 micro instances pending in us-east but they have not started yet it's been 15 minutes
[19:07] <Gozer_> Bid is at $0.003, I don't want to bump anyone else off and start a price war...
[19:18] <cascode> How long 'till Yahoo decides it's a DDOS and blocks EC2 netblocks?
[19:19] <ersi> Unlikely IMO
[19:19] <ersi> Also, please join #BurnTheMessenger instead - since that's the Yahoo! Messages project channel.
[19:20] <cascode> oops, sorry about that.  (Got the wrong IRC link from news.ycombinator.com, I guess.)
[19:32] <alard> soultcer: Are you working on the password-protection thing? Or can I?
[19:32] <soultcer> alard: I have not started on it yet. Feel free to implement it yourself
[19:33] <alard> I'll have a go then. I thought a command-line option would probably be enough to start with?
[19:33] <soultcer> config.json would be nicer because then it can be set with the userdata from ec2
[19:34] <soultcer> But for me the difficult part is understanding how to add http auth, especially to the websocket stuff. Changing from command-line arg to config file is easy
[19:34] <alard> Perhaps we can make it a combined option: --http-username --http-password *or* a config.json value.
[21:03] <dgsrgs962> love your work :) I wonder how long it'll be till they decide to cut Yahoo Groups
[21:32] <lukegb> soultcer: alard: I'm sort of tempted to add a single way of using the HTTP interface to control a whole set of warriors :P
[21:33] <lukegb> I think my stupidity overwhelmed alard's connection
[21:34] <ersi> lukegb: There's an API ish
[21:59] <daxelrod> Where can I find the repo for the code that makes up the Warrior web frontend?
[22:04] <ersi> daxelrod: https://github.com/ArchiveTeam/seesaw-kit
[22:05] <ersi> If I'm not mistaken, you want to look in seesaw/web.py
[22:05] <daxelrod> Ohh, it's in seesaw, ok
[22:05] <daxelrod> Thanks!
[22:05] <ersi> Yeah, the warrior scripts basically fix the environemnt and update the project code