#warrior 2013-03-23,Sat

↑back Search

Time Nickname Message
12:59 🔗 sep332 hey how come the "hackernews" username doesn't have a warrior icon next to it?
13:00 🔗 Cameron_D may've been set up standalone
13:01 🔗 sep332 i know "hackernews" is the default username for the image that was posted there
13:01 🔗 sep332 so lots of people are using it
13:03 🔗 Cameron_D yeah, it is standalone https://gist.github.com/duggan/5226732
15:40 🔗 soultcer I think I found the bug with the warrior AMI
15:40 🔗 soultcer Alard added an auto-reboot if the seesaw version was out of date
15:41 🔗 soultcer And I made it so that there would be a specific ec2 branch on the seesaw-kit repo
15:41 🔗 soultcer So it was always out of date and always rebooting
15:52 🔗 ersi hah, ouch
15:53 🔗 soultcer I had to make some changed to warrior-code2 so that it would work with ec2. E.g. it loads the config from ec2 userdata and so on
15:58 🔗 Layke Hey. How many URLs are actually being archived? What are the actual rate limits on each IP?
15:59 🔗 Layke (Referring to Yahoo)
15:59 🔗 sep332 for yahoo messages?
15:59 🔗 Layke Yeah sorry. Just realised that this channel might be used for other things as well :)
16:00 🔗 soultcer We don't know what kind of rate limiting yahoo uses, and we also don't know the amount of URLs, as the warriors constantly report back more URLs that need to be crawled
16:00 🔗 sep332 there is also a #burnthemessenger channel just for that
16:01 🔗 sep332 there's a lot of channels actually lol
16:01 🔗 Layke Ah right. I see. A while ago, I needed to pull about 6 million pages from one of the APIs provided by Yahoo because they were starting to switch to a pay for model, and I wanted to get everything before that happened. I just used 90 AWS instances and auto killed them every hour.
16:02 🔗 soultcer Clever ;-)
16:02 🔗 sep332 lol nice. sounds like what posterous is doing now :p
16:02 🔗 Layke No idea where I stand legally, but I figure if I refuse their current terms of service, and say that I am sticking to their previous terms I'm in the clear.
16:03 🔗 Layke But that was a useful exercise anyway. Only cost a few dollars as well.
16:04 🔗 Layke How do you mean, that sounds like what posterous is doing? (I know they are shutting down..)
16:05 🔗 sep332 they're banning our ips every hour
16:05 🔗 Layke Yeah,, they'll probably revert to entire ranges of AWS.
16:06 🔗 sep332 maybe. so far they haven't. their ban list overflowed at least once and old ips started working in less than a day sometimes haha
16:06 🔗 Layke O lol. I've never heard of that before. I wonder what point in their stack they were banning IPs then
16:07 🔗 sep332 they've actually been fairly cooperative, but i'm still not sure we're going to make it in time
16:10 🔗 Layke Is there a prebuilt AMI for AWS for the Yahoo messages?
16:16 🔗 IceKarma so, uh, I have VMware, not VirtualBox. does anyone know anything about converting the VM from one to the other?
16:16 🔗 tobbez I believe (at least recent versions) of vmware should support that image
16:17 🔗 sep332 it will convert for you automatically.
16:17 🔗 sep332 the only thing you have to do manually is move the second disk from 1:1 to 1:0.
16:18 🔗 IceKarma tobbez, sep332, ah, excellent
16:18 🔗 IceKarma 100 Mbps down, 5 up, here
16:18 🔗 IceKarma but yes, I know it's rate-limited
16:19 🔗 sep332 Layke: I think alard had one
16:19 🔗 lukegb Layke: yes.
16:20 🔗 lukegb Layke: https://gist.github.com/lukegb/5228290 <-- your username goes in the userdata field
16:22 🔗 soultcer alard: https://github.com/ArchiveTeam/warrior-preseed/commit/aa1429dd0f9150bd24ce5a0816712fd52d0fbcc6
16:23 🔗 Layke Nice lukegb
16:24 🔗 alard soultcer: What's that?
16:24 🔗 IceKarma tobbez, sep332, ah, that _is_ easy. File|Open, change the file type filter, point it at the .ova, and voilà
16:24 🔗 soultcer The script I have been using to create a complete Warrior AMI
16:24 🔗 alard Ah, nice.
16:25 🔗 soultcer Next step is to add user/password protection for the web interface
16:25 🔗 IceKarma hm, although it came up with an error and said something to the effect of "click Retry to try again with relaxed rules, but it might not work"
16:25 🔗 alard IceKarma: The .ova doesn't work in at least some versions of VMware, if I remember correctly.
16:25 🔗 soultcer I will have to figure out how websockets work first ;-)
16:25 🔗 IceKarma alard, I have 8.0.4
16:26 🔗 alard soultcer: Or just make something that lets you create an SSH tunnel.
16:26 🔗 lukegb alard: in Workstation 9 it does, but you have to hit retry and then change the 2nd HDD to be on 1:0 instead of 1:1
16:26 🔗 soultcer I think http auth is easier to use than ssh tunnels, especially on Windows where you'd have to use putty
16:27 🔗 IceKarma lukegb, yeah, just did that, about to try booting it
16:28 🔗 IceKarma and away it goes! =D
16:29 🔗 alard soultcer: Didn't we already have something with a password? Or was that just your suggestion?
16:29 🔗 soultcer It was my suggestion
16:29 🔗 soultcer I unfortunately haven't gotten around to implementing it yet, as for posterous I settled on creating an AMI that only contains seesaw-kit, not the full warrior
16:30 🔗 IceKarma I'd like to give props to the people who set up this VM: other than the thing with the import and then needing to change that disk's configuration, it worked flawlessly, and the management interface is really slick.
16:30 🔗 IceKarma excellent level of polish
16:32 🔗 tobbez What would be the easiest way if I want to run the archiver outside a vm? Is the code in the warrior-code2 repo what I want?
16:33 🔗 alard tobbez: No. You want the seesaw kit, pip install seesaw
16:33 🔗 alard That gives you a run-pipeline command that you can use to run the pipeline.py scripts.
16:33 🔗 alard tobbez: https://github.com/ArchiveTeam/yahoomessages-grab#running-without-a-warrior
16:34 🔗 Whoop Has there been any plans to turn the archiver into a puppet module or similar?
16:34 🔗 tobbez alard: Thanks
16:35 🔗 alard Whoop: We prefer people running the warrior VM, so there's a common system. There are often dependencies, such as our modified version of Wget, that need to be compiled if you're not on exactly the same system.
16:35 🔗 Whoop fair enough
16:36 🔗 alard So if you really want to run it on your own, you should at least be able to set it up yourself.
16:36 🔗 Layke How can I check that everything is running? I ran an AWS intsance
16:37 🔗 Layke I see several wget-lua processes being kicked off regularly, but not sure how to check properly
16:37 🔗 Whoop It was more to ease large scale deployments - that said, I wasnt aware there was whackyness such as modified wgets
16:38 🔗 alard Whoop: Well, feel free to create your own puppet thing and share it.
16:39 🔗 Layke Okay, I manually ran run-pipeline --concurrent 2 /home/ubuntu/yahoomessages-grab/pipeline.py Layke and can see things wokring. That looks good enough :)
16:43 🔗 tobbez alard: Where does it store the data? Relative to the current directory?
16:44 🔗 alard tobbez: Yes, I think it makes a data/ subdirectory. But you should check run-pipeline --help , because I don't remember the details at the moment.
16:44 🔗 tobbez alard: Didn't see anything in the --help output, that's why I asked
16:45 🔗 alard Isn't there an option for the data directory?
16:45 🔗 tobbez Not that I can see
16:46 🔗 alard Ah, no, that's only in the run-warrior version (that's what's running on the warrior VM). So in that case I think it's always ./data/
16:46 🔗 tobbez Alright, good
17:14 🔗 thomasbk question: what's the exact rate yahoo limits at? (and how much bandwidth is that?)
17:19 🔗 ersi Join #BurnTheMessenger for the Yahoo! Messanges archival project
18:44 🔗 daxelrod Are there instructions for running Warrior without a VM?
18:47 🔗 ersi You can run the scripts stand-alone, yes.
18:48 🔗 ersi First and foremost, I recommend joining #BurnTheMessenger instead - since that's the Yahoo! Messages project channel
18:49 🔗 daxelrod I'm there too
19:05 🔗 Gozer_ Hi all
19:05 🔗 alard Gozer_: Hello.
19:05 🔗 Gozer_ Got 10 micro instances pending in us-east but they have not started yet it's been 15 minutes
19:07 🔗 Gozer_ Bid is at $0.003, I don't want to bump anyone else off and start a price war...
19:18 🔗 cascode How long 'till Yahoo decides it's a DDOS and blocks EC2 netblocks?
19:19 🔗 ersi Unlikely IMO
19:19 🔗 ersi Also, please join #BurnTheMessenger instead - since that's the Yahoo! Messages project channel.
19:20 🔗 cascode oops, sorry about that. (Got the wrong IRC link from news.ycombinator.com, I guess.)
19:32 🔗 alard soultcer: Are you working on the password-protection thing? Or can I?
19:32 🔗 soultcer alard: I have not started on it yet. Feel free to implement it yourself
19:33 🔗 alard I'll have a go then. I thought a command-line option would probably be enough to start with?
19:33 🔗 soultcer config.json would be nicer because then it can be set with the userdata from ec2
19:34 🔗 soultcer But for me the difficult part is understanding how to add http auth, especially to the websocket stuff. Changing from command-line arg to config file is easy
19:34 🔗 alard Perhaps we can make it a combined option: --http-username --http-password *or* a config.json value.
21:03 🔗 dgsrgs962 love your work :) I wonder how long it'll be till they decide to cut Yahoo Groups
21:32 🔗 lukegb soultcer: alard: I'm sort of tempted to add a single way of using the HTTP interface to control a whole set of warriors :P
21:33 🔗 lukegb I think my stupidity overwhelmed alard's connection
21:34 🔗 ersi lukegb: There's an API ish
21:59 🔗 daxelrod Where can I find the repo for the code that makes up the Warrior web frontend?
22:04 🔗 ersi daxelrod: https://github.com/ArchiveTeam/seesaw-kit
22:05 🔗 ersi If I'm not mistaken, you want to look in seesaw/web.py
22:05 🔗 daxelrod Ohh, it's in seesaw, ok
22:05 🔗 daxelrod Thanks!
22:05 🔗 ersi Yeah, the warrior scripts basically fix the environemnt and update the project code
