[12:59] hey how come the "hackernews" username doesn't have a warrior icon next to it? [13:00] may've been set up standalone [13:01] i know "hackernews" is the default username for the image that was posted there [13:01] so lots of people are using it [13:03] yeah, it is standalone https://gist.github.com/duggan/5226732 [15:40] I think I found the bug with the warrior AMI [15:40] Alard added an auto-reboot if the seesaw version was out of date [15:41] And I made it so that there would be a specific ec2 branch on the seesaw-kit repo [15:41] So it was always out of date and always rebooting [15:52] hah, ouch [15:53] I had to make some changed to warrior-code2 so that it would work with ec2. E.g. it loads the config from ec2 userdata and so on [15:58] Hey. How many URLs are actually being archived? What are the actual rate limits on each IP? [15:59] (Referring to Yahoo) [15:59] for yahoo messages? [15:59] Yeah sorry. Just realised that this channel might be used for other things as well :) [16:00] We don't know what kind of rate limiting yahoo uses, and we also don't know the amount of URLs, as the warriors constantly report back more URLs that need to be crawled [16:00] there is also a #burnthemessenger channel just for that [16:01] there's a lot of channels actually lol [16:01] Ah right. I see. A while ago, I needed to pull about 6 million pages from one of the APIs provided by Yahoo because they were starting to switch to a pay for model, and I wanted to get everything before that happened. I just used 90 AWS instances and auto killed them every hour. [16:02] Clever ;-) [16:02] lol nice. sounds like what posterous is doing now :p [16:02] No idea where I stand legally, but I figure if I refuse their current terms of service, and say that I am sticking to their previous terms I'm in the clear. [16:03] But that was a useful exercise anyway. Only cost a few dollars as well. [16:04] How do you mean, that sounds like what posterous is doing? (I know they are shutting down..) [16:05] they're banning our ips every hour [16:05] Yeah,, they'll probably revert to entire ranges of AWS. [16:06] maybe. so far they haven't. their ban list overflowed at least once and old ips started working in less than a day sometimes haha [16:06] O lol. I've never heard of that before. I wonder what point in their stack they were banning IPs then [16:07] they've actually been fairly cooperative, but i'm still not sure we're going to make it in time [16:10] Is there a prebuilt AMI for AWS for the Yahoo messages? [16:16] so, uh, I have VMware, not VirtualBox. does anyone know anything about converting the VM from one to the other? [16:16] I believe (at least recent versions) of vmware should support that image [16:17] it will convert for you automatically. [16:17] the only thing you have to do manually is move the second disk from 1:1 to 1:0. [16:18] tobbez, sep332, ah, excellent [16:18] 100 Mbps down, 5 up, here [16:18] but yes, I know it's rate-limited [16:19] Layke: I think alard had one [16:19] Layke: yes. [16:20] Layke: https://gist.github.com/lukegb/5228290 <-- your username goes in the userdata field [16:22] alard: https://github.com/ArchiveTeam/warrior-preseed/commit/aa1429dd0f9150bd24ce5a0816712fd52d0fbcc6 [16:23] Nice lukegb [16:24] soultcer: What's that? [16:24] tobbez, sep332, ah, that _is_ easy. File|Open, change the file type filter, point it at the .ova, and voilà [16:24] The script I have been using to create a complete Warrior AMI [16:24] Ah, nice. [16:25] Next step is to add user/password protection for the web interface [16:25] hm, although it came up with an error and said something to the effect of "click Retry to try again with relaxed rules, but it might not work" [16:25] IceKarma: The .ova doesn't work in at least some versions of VMware, if I remember correctly. [16:25] I will have to figure out how websockets work first ;-) [16:25] alard, I have 8.0.4 [16:26] soultcer: Or just make something that lets you create an SSH tunnel. [16:26] alard: in Workstation 9 it does, but you have to hit retry and then change the 2nd HDD to be on 1:0 instead of 1:1 [16:26] I think http auth is easier to use than ssh tunnels, especially on Windows where you'd have to use putty [16:27] lukegb, yeah, just did that, about to try booting it [16:28] and away it goes! =D [16:29] soultcer: Didn't we already have something with a password? Or was that just your suggestion? [16:29] It was my suggestion [16:29] I unfortunately haven't gotten around to implementing it yet, as for posterous I settled on creating an AMI that only contains seesaw-kit, not the full warrior [16:30] I'd like to give props to the people who set up this VM: other than the thing with the import and then needing to change that disk's configuration, it worked flawlessly, and the management interface is really slick. [16:30] excellent level of polish [16:32] What would be the easiest way if I want to run the archiver outside a vm? Is the code in the warrior-code2 repo what I want? [16:33] tobbez: No. You want the seesaw kit, pip install seesaw [16:33] That gives you a run-pipeline command that you can use to run the pipeline.py scripts. [16:33] tobbez: https://github.com/ArchiveTeam/yahoomessages-grab#running-without-a-warrior [16:34] Has there been any plans to turn the archiver into a puppet module or similar? [16:34] alard: Thanks [16:35] Whoop: We prefer people running the warrior VM, so there's a common system. There are often dependencies, such as our modified version of Wget, that need to be compiled if you're not on exactly the same system. [16:35] fair enough [16:36] So if you really want to run it on your own, you should at least be able to set it up yourself. [16:36] How can I check that everything is running? I ran an AWS intsance [16:37] I see several wget-lua processes being kicked off regularly, but not sure how to check properly [16:37] It was more to ease large scale deployments - that said, I wasnt aware there was whackyness such as modified wgets [16:38] Whoop: Well, feel free to create your own puppet thing and share it. [16:39] Okay, I manually ran run-pipeline --concurrent 2 /home/ubuntu/yahoomessages-grab/pipeline.py Layke and can see things wokring. That looks good enough :) [16:43] alard: Where does it store the data? Relative to the current directory? [16:44] tobbez: Yes, I think it makes a data/ subdirectory. But you should check run-pipeline --help , because I don't remember the details at the moment. [16:44] alard: Didn't see anything in the --help output, that's why I asked [16:45] Isn't there an option for the data directory? [16:45] Not that I can see [16:46] Ah, no, that's only in the run-warrior version (that's what's running on the warrior VM). So in that case I think it's always ./data/ [16:46] Alright, good [17:14] question: what's the exact rate yahoo limits at? (and how much bandwidth is that?) [17:19] Join #BurnTheMessenger for the Yahoo! Messanges archival project [18:44] Are there instructions for running Warrior without a VM? [18:47] You can run the scripts stand-alone, yes. [18:48] First and foremost, I recommend joining #BurnTheMessenger instead - since that's the Yahoo! Messages project channel [18:49] I'm there too [19:05] Hi all [19:05] Gozer_: Hello. [19:05] Got 10 micro instances pending in us-east but they have not started yet it's been 15 minutes [19:07] Bid is at $0.003, I don't want to bump anyone else off and start a price war... [19:18] How long 'till Yahoo decides it's a DDOS and blocks EC2 netblocks? [19:19] Unlikely IMO [19:19] Also, please join #BurnTheMessenger instead - since that's the Yahoo! Messages project channel. [19:20] oops, sorry about that. (Got the wrong IRC link from news.ycombinator.com, I guess.) [19:32] soultcer: Are you working on the password-protection thing? Or can I? [19:32] alard: I have not started on it yet. Feel free to implement it yourself [19:33] I'll have a go then. I thought a command-line option would probably be enough to start with? [19:33] config.json would be nicer because then it can be set with the userdata from ec2 [19:34] But for me the difficult part is understanding how to add http auth, especially to the websocket stuff. Changing from command-line arg to config file is easy [19:34] Perhaps we can make it a combined option: --http-username --http-password *or* a config.json value. [21:03] love your work :) I wonder how long it'll be till they decide to cut Yahoo Groups [21:32] soultcer: alard: I'm sort of tempted to add a single way of using the HTTP interface to control a whole set of warriors :P [21:33] I think my stupidity overwhelmed alard's connection [21:34] lukegb: There's an API ish [21:59] Where can I find the repo for the code that makes up the Warrior web frontend? [22:04] daxelrod: https://github.com/ArchiveTeam/seesaw-kit [22:05] If I'm not mistaken, you want to look in seesaw/web.py [22:05] Ohh, it's in seesaw, ok [22:05] Thanks! [22:05] Yeah, the warrior scripts basically fix the environemnt and update the project code