Time |
Nickname |
Message |
12:59
🔗
|
sep332 |
hey how come the "hackernews" username doesn't have a warrior icon next to it? |
13:00
🔗
|
Cameron_D |
may've been set up standalone |
13:01
🔗
|
sep332 |
i know "hackernews" is the default username for the image that was posted there |
13:01
🔗
|
sep332 |
so lots of people are using it |
13:03
🔗
|
Cameron_D |
yeah, it is standalone https://gist.github.com/duggan/5226732 |
15:40
🔗
|
soultcer |
I think I found the bug with the warrior AMI |
15:40
🔗
|
soultcer |
Alard added an auto-reboot if the seesaw version was out of date |
15:41
🔗
|
soultcer |
And I made it so that there would be a specific ec2 branch on the seesaw-kit repo |
15:41
🔗
|
soultcer |
So it was always out of date and always rebooting |
15:52
🔗
|
ersi |
hah, ouch |
15:53
🔗
|
soultcer |
I had to make some changed to warrior-code2 so that it would work with ec2. E.g. it loads the config from ec2 userdata and so on |
15:58
🔗
|
Layke |
Hey. How many URLs are actually being archived? What are the actual rate limits on each IP? |
15:59
🔗
|
Layke |
(Referring to Yahoo) |
15:59
🔗
|
sep332 |
for yahoo messages? |
15:59
🔗
|
Layke |
Yeah sorry. Just realised that this channel might be used for other things as well :) |
16:00
🔗
|
soultcer |
We don't know what kind of rate limiting yahoo uses, and we also don't know the amount of URLs, as the warriors constantly report back more URLs that need to be crawled |
16:00
🔗
|
sep332 |
there is also a #burnthemessenger channel just for that |
16:01
🔗
|
sep332 |
there's a lot of channels actually lol |
16:01
🔗
|
Layke |
Ah right. I see. A while ago, I needed to pull about 6 million pages from one of the APIs provided by Yahoo because they were starting to switch to a pay for model, and I wanted to get everything before that happened. I just used 90 AWS instances and auto killed them every hour. |
16:02
🔗
|
soultcer |
Clever ;-) |
16:02
🔗
|
sep332 |
lol nice. sounds like what posterous is doing now :p |
16:02
🔗
|
Layke |
No idea where I stand legally, but I figure if I refuse their current terms of service, and say that I am sticking to their previous terms I'm in the clear. |
16:03
🔗
|
Layke |
But that was a useful exercise anyway. Only cost a few dollars as well. |
16:04
🔗
|
Layke |
How do you mean, that sounds like what posterous is doing? (I know they are shutting down..) |
16:05
🔗
|
sep332 |
they're banning our ips every hour |
16:05
🔗
|
Layke |
Yeah,, they'll probably revert to entire ranges of AWS. |
16:06
🔗
|
sep332 |
maybe. so far they haven't. their ban list overflowed at least once and old ips started working in less than a day sometimes haha |
16:06
🔗
|
Layke |
O lol. I've never heard of that before. I wonder what point in their stack they were banning IPs then |
16:07
🔗
|
sep332 |
they've actually been fairly cooperative, but i'm still not sure we're going to make it in time |
16:10
🔗
|
Layke |
Is there a prebuilt AMI for AWS for the Yahoo messages? |
16:16
🔗
|
IceKarma |
so, uh, I have VMware, not VirtualBox. does anyone know anything about converting the VM from one to the other? |
16:16
🔗
|
tobbez |
I believe (at least recent versions) of vmware should support that image |
16:17
🔗
|
sep332 |
it will convert for you automatically. |
16:17
🔗
|
sep332 |
the only thing you have to do manually is move the second disk from 1:1 to 1:0. |
16:18
🔗
|
IceKarma |
tobbez, sep332, ah, excellent |
16:18
🔗
|
IceKarma |
100 Mbps down, 5 up, here |
16:18
🔗
|
IceKarma |
but yes, I know it's rate-limited |
16:19
🔗
|
sep332 |
Layke: I think alard had one |
16:19
🔗
|
lukegb |
Layke: yes. |
16:20
🔗
|
lukegb |
Layke: https://gist.github.com/lukegb/5228290 <-- your username goes in the userdata field |
16:22
🔗
|
soultcer |
alard: https://github.com/ArchiveTeam/warrior-preseed/commit/aa1429dd0f9150bd24ce5a0816712fd52d0fbcc6 |
16:23
🔗
|
Layke |
Nice lukegb |
16:24
🔗
|
alard |
soultcer: What's that? |
16:24
🔗
|
IceKarma |
tobbez, sep332, ah, that _is_ easy. File|Open, change the file type filter, point it at the .ova, and voilà|
16:24
🔗
|
soultcer |
The script I have been using to create a complete Warrior AMI |
16:24
🔗
|
alard |
Ah, nice. |
16:25
🔗
|
soultcer |
Next step is to add user/password protection for the web interface |
16:25
🔗
|
IceKarma |
hm, although it came up with an error and said something to the effect of "click Retry to try again with relaxed rules, but it might not work" |
16:25
🔗
|
alard |
IceKarma: The .ova doesn't work in at least some versions of VMware, if I remember correctly. |
16:25
🔗
|
soultcer |
I will have to figure out how websockets work first ;-) |
16:25
🔗
|
IceKarma |
alard, I have 8.0.4 |
16:26
🔗
|
alard |
soultcer: Or just make something that lets you create an SSH tunnel. |
16:26
🔗
|
lukegb |
alard: in Workstation 9 it does, but you have to hit retry and then change the 2nd HDD to be on 1:0 instead of 1:1 |
16:26
🔗
|
soultcer |
I think http auth is easier to use than ssh tunnels, especially on Windows where you'd have to use putty |
16:27
🔗
|
IceKarma |
lukegb, yeah, just did that, about to try booting it |
16:28
🔗
|
IceKarma |
and away it goes! =D |
16:29
🔗
|
alard |
soultcer: Didn't we already have something with a password? Or was that just your suggestion? |
16:29
🔗
|
soultcer |
It was my suggestion |
16:29
🔗
|
soultcer |
I unfortunately haven't gotten around to implementing it yet, as for posterous I settled on creating an AMI that only contains seesaw-kit, not the full warrior |
16:30
🔗
|
IceKarma |
I'd like to give props to the people who set up this VM: other than the thing with the import and then needing to change that disk's configuration, it worked flawlessly, and the management interface is really slick. |
16:30
🔗
|
IceKarma |
excellent level of polish |
16:32
🔗
|
tobbez |
What would be the easiest way if I want to run the archiver outside a vm? Is the code in the warrior-code2 repo what I want? |
16:33
🔗
|
alard |
tobbez: No. You want the seesaw kit, pip install seesaw |
16:33
🔗
|
alard |
That gives you a run-pipeline command that you can use to run the pipeline.py scripts. |
16:33
🔗
|
alard |
tobbez: https://github.com/ArchiveTeam/yahoomessages-grab#running-without-a-warrior |
16:34
🔗
|
Whoop |
Has there been any plans to turn the archiver into a puppet module or similar? |
16:34
🔗
|
tobbez |
alard: Thanks |
16:35
🔗
|
alard |
Whoop: We prefer people running the warrior VM, so there's a common system. There are often dependencies, such as our modified version of Wget, that need to be compiled if you're not on exactly the same system. |
16:35
🔗
|
Whoop |
fair enough |
16:36
🔗
|
alard |
So if you really want to run it on your own, you should at least be able to set it up yourself. |
16:36
🔗
|
Layke |
How can I check that everything is running? I ran an AWS intsance |
16:37
🔗
|
Layke |
I see several wget-lua processes being kicked off regularly, but not sure how to check properly |
16:37
🔗
|
Whoop |
It was more to ease large scale deployments - that said, I wasnt aware there was whackyness such as modified wgets |
16:38
🔗
|
alard |
Whoop: Well, feel free to create your own puppet thing and share it. |
16:39
🔗
|
Layke |
Okay, I manually ran run-pipeline --concurrent 2 /home/ubuntu/yahoomessages-grab/pipeline.py Layke and can see things wokring. That looks good enough :) |
16:43
🔗
|
tobbez |
alard: Where does it store the data? Relative to the current directory? |
16:44
🔗
|
alard |
tobbez: Yes, I think it makes a data/ subdirectory. But you should check run-pipeline --help , because I don't remember the details at the moment. |
16:44
🔗
|
tobbez |
alard: Didn't see anything in the --help output, that's why I asked |
16:45
🔗
|
alard |
Isn't there an option for the data directory? |
16:45
🔗
|
tobbez |
Not that I can see |
16:46
🔗
|
alard |
Ah, no, that's only in the run-warrior version (that's what's running on the warrior VM). So in that case I think it's always ./data/ |
16:46
🔗
|
tobbez |
Alright, good |
17:14
🔗
|
thomasbk |
question: what's the exact rate yahoo limits at? (and how much bandwidth is that?) |
17:19
🔗
|
ersi |
Join #BurnTheMessenger for the Yahoo! Messanges archival project |
18:44
🔗
|
daxelrod |
Are there instructions for running Warrior without a VM? |
18:47
🔗
|
ersi |
You can run the scripts stand-alone, yes. |
18:48
🔗
|
ersi |
First and foremost, I recommend joining #BurnTheMessenger instead - since that's the Yahoo! Messages project channel |
18:49
🔗
|
daxelrod |
I'm there too |
19:05
🔗
|
Gozer_ |
Hi all |
19:05
🔗
|
alard |
Gozer_: Hello. |
19:05
🔗
|
Gozer_ |
Got 10 micro instances pending in us-east but they have not started yet it's been 15 minutes |
19:07
🔗
|
Gozer_ |
Bid is at $0.003, I don't want to bump anyone else off and start a price war... |
19:18
🔗
|
cascode |
How long 'till Yahoo decides it's a DDOS and blocks EC2 netblocks? |
19:19
🔗
|
ersi |
Unlikely IMO |
19:19
🔗
|
ersi |
Also, please join #BurnTheMessenger instead - since that's the Yahoo! Messages project channel. |
19:20
🔗
|
cascode |
oops, sorry about that. (Got the wrong IRC link from news.ycombinator.com, I guess.) |
19:32
🔗
|
alard |
soultcer: Are you working on the password-protection thing? Or can I? |
19:32
🔗
|
soultcer |
alard: I have not started on it yet. Feel free to implement it yourself |
19:33
🔗
|
alard |
I'll have a go then. I thought a command-line option would probably be enough to start with? |
19:33
🔗
|
soultcer |
config.json would be nicer because then it can be set with the userdata from ec2 |
19:34
🔗
|
soultcer |
But for me the difficult part is understanding how to add http auth, especially to the websocket stuff. Changing from command-line arg to config file is easy |
19:34
🔗
|
alard |
Perhaps we can make it a combined option: --http-username --http-password *or* a config.json value. |
21:03
🔗
|
dgsrgs962 |
love your work :) I wonder how long it'll be till they decide to cut Yahoo Groups |
21:32
🔗
|
lukegb |
soultcer: alard: I'm sort of tempted to add a single way of using the HTTP interface to control a whole set of warriors :P |
21:33
🔗
|
lukegb |
I think my stupidity overwhelmed alard's connection |
21:34
🔗
|
ersi |
lukegb: There's an API ish |
21:59
🔗
|
daxelrod |
Where can I find the repo for the code that makes up the Warrior web frontend? |
22:04
🔗
|
ersi |
daxelrod: https://github.com/ArchiveTeam/seesaw-kit |
22:05
🔗
|
ersi |
If I'm not mistaken, you want to look in seesaw/web.py |
22:05
🔗
|
daxelrod |
Ohh, it's in seesaw, ok |
22:05
🔗
|
daxelrod |
Thanks! |
22:05
🔗
|
ersi |
Yeah, the warrior scripts basically fix the environemnt and update the project code |