-
Website
http://www.datawrangling.com/ -
Original page
http://www.datawrangling.com/mpi-cluster-with-python-and-amazon-ec2-part-2-of-3 -
Subscribe
All Comments -
Community
-
Top Commenters
-
yaroslavvb
1 comment · 1 points
-
endah
1 comment · 1 points
-
malar
1 comment · 1 points
-
bearrito
1 comment · 1 points
-
pskomoroch
9 comments · 1 points
-
-
Popular Threads
Excellent stuff! I've gotten started with EC2 and I'll be trying your images out soon. I doubt that I'll be trying to make ParallelKnoppix work on EC2, because your approach is the right one, I think. PK is designed to use when the hardware is not known ahead of time. With EC2, the hardware is known, so a tailor-made image is the way to go. Your scripts allow an on-demand cluster to be created in minutes, and that's all that PK offers, anyway. PK usually needs some remastering so that users can add their own packages. Re-bundling an EC2 image is completely analogous. I'm planning on doing just that, probably starting with your images, and doing some testing of latency on tasks that require different degrees of internode communication. Thanks for all this, it'll make the rest an easy job.
One question, do you know if something like an NFS shared home directory is possible. Using S3, possibly?
A little report on my trial.
1) ./ec2-start_cluster.py is not always successful in getting the requested number of nodes to come up. The instances sometimes have status "terminated" before anything is done with them.
2) When the 5 nodes all come up, I still get a problem with ./ec2-mpi-config.py requesting a root password:
michael@yosemite:~/ec2/AmazonEC2_MPI_scripts$ ./ec2-mpi-config.py
---- MPI Cluster Details ----
Numer of nodes = 5
Instance= i-e39c7a8a hostname= ec2-72-44-45-138.z-2.compute-1.amazonaws.com state= running
Instance= i-e29c7a8b hostname= ec2-72-44-45-185.z-2.compute-1.amazonaws.com state= running
Instance= i-e59c7a8c hostname= ec2-72-44-45-186.z-2.compute-1.amazonaws.com state= running
Instance= i-e49c7a8d hostname= ec2-72-44-45-122.z-2.compute-1.amazonaws.com state= running
Instance= i-e79c7a8e hostname= ec2-72-44-45-60.z-2.compute-1.amazonaws.com state= running
The master node is ec2-72-44-45-138.z-2.compute-1.amazonaws.com
Writing out mpd.hosts file
nslookup ec2-72-44-45-138.z-2.compute-1.amazonaws.com
(0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-138.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.138\n')
nslookup ec2-72-44-45-185.z-2.compute-1.amazonaws.com
(0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-185.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.185\n')
nslookup ec2-72-44-45-186.z-2.compute-1.amazonaws.com
(0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-186.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.186\n')
nslookup ec2-72-44-45-122.z-2.compute-1.amazonaws.com
(0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-122.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.122\n')
nslookup ec2-72-44-45-60.z-2.compute-1.amazonaws.com
(0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-60.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.60\n')
Warning: Permanently added 'ec2-72-44-45-138.z-2.compute-1.amazonaws.com,72.44.45.138' (RSA) to the list of known hosts.
id_rsa.pub 100% 1675 1.6KB/s 00:00
root@ec2-72-44-45-138.z-2.compute-1.amazonaws.com's password:
This is as far as I can get at the moment. Looks like a minor problem. Cheers, M.
Michael,
I haven't had the scripts prompt me for a password before, are you running them from your local machine? The mpi-config script expects the keyname and keypair location to match what was used to start the instance. Take a look at your EC2config.py file and make sure the instances were all started with your own keypair (i used the gsg keypair I created on my laptop in the Amazon "getting started guide" tutorial):
AWS_ACCESS_KEY_ID = ‘YOUR_KEY_ID_HERE’
AWS_SECRET_ACCESS_KEY = ‘YOUR_KEY_HERE’
MASTER_IMAGE_ID = "ami-3e836657"
IMAGE_ID = "ami-3e836657"
KEYNAME = "gsg-keypair"
KEY_LOCATION = "~/id_rsa-gsg-keypair"
DEFAULT_CLUSTER_SIZE = 5
I'm working on an updated version of the scripts and EC2 image which should make things a bit cleaner. Sorry the code is ugly right now in terms of error handling...I just wanted to toss something together to get people started :)
Yep, I run the mpi-config script right after creating the instances, doing just what you suggest. The fact that the instances start up at all seems to me to mean that the keypair information is ok. Do you know if anyone but you has been able to launch a cluster? Very cool stuff. I'm going to be looking into making a Debian AMI that works the same way.
Mike Cariaso modified my scripts to fix some path issues and got it working on a windows laptop, he might have also fixed some other errors I didn't notice. I haven't had a chance to try them yet, but you can download the modified scripts here:
http://mpiblast.pbwiki.com/AmazonEC2
===== DO NOT USE THESE SCRIPTS! =====
This section of ec2-mpi-config.py is a bit problematic:
os.system('cp %s ~/id_rsa.pub' % KEY_LOCATION )
os.system('cp ~/id_rsa.pub ~/.ssh/id_rsa')
This will clobber any existing rsa key on the initiating machine's account, and with break normal auth on the next login if you have a different default rsa key!
The script should instead copy the private key directly from KEY_LOCATION to the nodes.
===== DO NOT USE THESE SCRIPTS! =====
Otherwise, way cool. Thanks for putting this tutorial together. We're trying EC2 clusters out as a way to get quicker feedback from regression tests after changes to our software. Unfortunately, with the one hour granularity I don't think it will be price competitive. We want 20-100 nodes for about 5 minutes at a time.
Ralph,
Good catch. Thanks for pointing that out. I just lifted those passwordless ssh lines straight from an MPI tutorial.
This might solve the clobbering as well (from http://www.maclife.com/forums/topic/61520):
cat id_rsa.pub >> .ssh/authorized_keys"The above command will create the "authorized_keys" file in the ".ssh" directory if that file doesn't already exist, and it will append the new id_rsa.pub file to it if it does already exist."
I'll add that change to the scripts. Good luck with the regression cluster, I heard Oracle developers do something like that using Condor on otherwise idle desktops (see http://www.cs.wisc.edu/condor/doc/nmi-lisa2006-slides.pdf).
-Pete
Yeah, that would work better. Some more detailed comments:
<ul>
<li>
Your image has /home/lamuser/.mpd.conf owned by root. I had to chown it to lamuser before I could start mpd.
</li><li>
You script passes the public dns names for the nodes into mpd.hosts. For that to work, a hole has to be opened in the firewall for the ports the mpi daemon is using. A simpler solution is to just pass the internal dns names. Then all the traffic happens behind the firewall, which probably also improves latency. (Although my ringtest was noticably slower than yours, averaging 2.2e-3 seconds/loop so who knows?)
</li><li>
I was surprised that when I originally ran ec2-add-keypair in the EC2 tutorial that it uploaded the public key (ok) and printed out the private key (ok I guess) but didn't print out the public key locally (weird). Your scripts seem to assume the public key is available as id_rsa.pub on the client machine. Shouldn't this first be copied either from /root/.ssh/authorized_keys on the master node (as installed by amazon) or retrieved through the query interface?
</li></ul>
Is the mutual ssh access required for more than just launching the MPI daemon? If all subsequent traffic goes through the mpi daemons, starting mpd from the client machine, or automatically from the init scripts after pulling mpd.hosts from S3 would save the whole trouble, including uploading the private key at all.
Ralph,
More good points. I've been tied up with some other projects, but it sounds like enough feedback is in to make a revised version of the image and scripts. I expect the latency to vary a bit depending on the random EC2 network topology when a cluster is launched...(instances on the same box vs. over ethernet) that might explain the ringtest. The mutual ssh access was set up since we do a lot of file/data shuffling between nodes outside of MPI.
Thanks again, looking forward to hearing how the regression test system works out.
-Pete
Update (7-24-07): I’ve made some important bug fixes to the scripts to address issues mentioned in the comments.
Specific changes made:
<ul>
<li>fixed lamuser home directory permissions bug</li>
<li>fixed section of ec2-mpi-config.py which clobbered existing rsa keys on the client machine</li>
<li>Updated calls of AWS python EC2 library to use API version 2007-01-19
http://developer.amazonwebservices.com/connect/...>
<li>fixed mpdboot issue by using amazon internal DNS names in hosts files</li>
<li>scripts should now work on windows/cygwin client environments</li>
</ul>
After I run some benchmarks, I'm hoping to find some time to add LAM and OpenMPI to the EC2 image along with NFS configuration, C3 cluster tools, Ganglia, and a benchmarking package.
What about that Part 3? :)
the first two parts really set the stage ... Part 3?
:)
Does the 5 month hiatus in this project mean that it was a bad idea and you guys have learnt enough to waste no more time on it?
Given the virtualization uncertainty, finding the right communication/computation balance for typical MPI programs appears to be very unrewarding. Secondly, MPI development and debug and then QA and scale out are not addressed, which doesn't bode well. It appears most productive to have a local small cluster for development and debug, and then do QA and scale out on EC2, but some benchmarking numbers would really help.
If EC2 is only robust for embarrassingly parallel problems, then MapReduce style programs are more attractive. There the size of the data set and how well it integrates in a distributed file system appear to be the problems to focus on. Or BOINC like approaches if there is no integrated DFS. Anyone have operational data on these approaches?
Theo,
Sorry for the delay in posting this and responding. I've been working on a startup for the past 7 months and was in serious crunch mode. Don't read too much into the large gap in posts, it is just me working on this as a side-project. I finished moving the blog to another host and finally have some time to get back to the EC2 work. This experience has taught me to never name a series of blog posts "part 1 of N" :)
You make some excellent points. One thing that has changed since I wrote the first post is that EC2 now offers larger 64bit machine images with better I/O (you can provision an entire physical server and not be limited by sharing network resources in the virtual instance). I'd like to see if this improves the network performance. I'm giving a talk on this in March, so I'm on the hook to have some benchmarks by then.
I also agree on the mapreduce side. For embarrassingly parallel problems, hadoop on ec2 is potentially much more attractive...more robust, easier for most people to program. Ideally, I would like to do some comparisons between the two approaches and run the numbers.
The performance of an EC2 MPI cluster is definitely going to be worse than your own custom hardware, but it still might fit certain niche situations. In my case, I needed to run some MPI code for a large problem and didn't have access to a large enough cluster. The performance on EC2 was nowhere near what you get on a high-end cluster, but it got the job done for a reasonable price.
This discussion on the beowulf list goes into more detail on the pros/cons:
http://www.beowulf.org/pipermail/beowulf/2008-January/020490.html
-Pete
Can't get the ec-mpi-config to work. Says list index out of range for mpi-externalnames[0] on line 108
start cluster and check instances are OK so I think that python, EC, elementree
are OK
Any ideas why? Has AWS changed the format of the response you're parsing (yes I have had a look at the python code but since I haven't used python before I can't see anything obvious to me)
BTW you have a typo in mpi config Numer of nodes as opposed to Number of nodes , it even shows in your example above.
Otherwise I like what you've done, I'd just like it to work for me.
Thanks,
Pete
pete found the error... the image Ids he entered into the config module inadvertently contained a capital letter. This doesn’t cause any problems for starting images since string case is ignored by Amazon. The corresponding image id response string from AWS is always lowercase, so the python script comparison on image ID string fails.
In the next version of the scripts, I will handle upper/lowercase differences in the ami strings. For now, just make sure to use all lower case or call the python .lower() method,
<pre lang="python">
>>> test = 'ami-fE9a7f97'
>>> test.lower()
'ami-fe9a7f97'
>>>
</pre>
Found another typo too, ok I'm nit picking. In the stop-cluster script the message says Stoping as opposed to stopping. A year ago when you first posted this stuff you mentioned that the reason why the non-root user was called lamuser was that the scripts were used for LAM in some previous incarnation. Since I'm actually trying to use LAM, if you have any LAM stuff around that might help me to iron out one or two problems I still have.
Anyway, thanks again,
Pete
No problem, thanks for finding the typos. These were meant to be some quick hacks, but took on a life of their own after a while.
I found this worked for configuring LAM, I'll send you more details in an email...
The contents of bash_profile should be as follows:
<pre lang="code">
-bash-3.1# more .bash_profile
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
LAMRSH="ssh -x"
export LAMRSH
LD_LIBRARY_PATH="/usr/local/lam-7.1.2/lib/"
export LD_LIBRARY_PATH
MPICH_PORT_RANGE="2000:8000"
export MPICH_PORT_RANGE
PATH=$PATH:$HOME/bin
PATH=/usr/local/lam-7.1.2/bin:$PATH
MANPATH=/usr/local/lam-7.1.2/man:$MANPATH
export PATH
export MANPATH
</pre>
Launch the cluster on EC2 and try booting LAM manually:
<pre lang="code">
[lamuser@domU-12-31-33-00-04-4B ~]$ lamboot /etc/mpd.hosts
[lamuser@domU-12-31-33-00-04-4B ~]$ lamnodes
n0 domU-12-31-33-00-04-4B.usma1.compute.amazonaws.com:1:origin,this_node
n1 domU-12-31-33-00-03-35.usma1.compute.amazonaws.com:1:
n2 domU-12-31-33-00-03-3C.usma1.compute.amazonaws.com:1:
n3 domU-12-31-34-00-00-55.usma2.compute.amazonaws.com:1:
[lamuser@domU-12-31-33-00-04-4B ~]$ tping N -c3
1 byte from 3 remote nodes and 1 local node: 0.039 secs
1 byte from 3 remote nodes and 1 local node: 0.004 secs
1 byte from 3 remote nodes and 1 local node: 0.002 secs
</pre>
Why does it ask me for a password when i try to run the ec2-mpi-config.py file.?
it says root@xxx password:
And I get a lot of text on the terminal when I try running the file.
raghav,
I assume you were able to start the instances with ec2-start-cluster.py? The text on the terminal is normal, but it shouldn't ask you for a password (I should probably add a verbose option instead of streaming out text by default). There was a path issue on windows with an earlier version of the scripts, so that may be the problem.
If you send me the script version number from the README and/or terminal output, I can try to track down what is going on...
peter.skomoroch@gmail.com
-Pete
raghav,
Another suggestion is to make sure the instances are running with ./ec2-check-instances.py and then retry the script, sometimes it takes a while for sshd to start up on EC2.
-Pete
Hey guys,
Actually I made a change in the ec2-mpi-cluster.py file. I have no clue about python and I dono why it worked but it worked.
I modified:
template = ssh -o "StrictHostKeyChecking no" %(user)s@%(host)s "%(cmd)s"
to
template = 'ssh -i "/home/id_rsa-gsg-keypair" %(user)s@%(host)s "%(cmd)s"
and
template = '%(cmd)s %(switches)s -o "StrictHostKeyChecking no" %(src)s %(user)s@%(host)s:%(dest)s'
to
template = '%(cmd)s %(switches)s -i "/home/id_rsa-gsg-keypair" %(src)s %(user)s@%(host)s:%(dest)s'
And it started working perfectly fine. I was able to log in to the master node and the pi problem executed perfectly fine.
Thanks a lot guys
Cheers,
Raghav
Thanks pete. For your prompt reply!!
Thanks Pete. I wish I had made the PyCon session, but these posts have been very helpful. The cluster went up pretty quickly and I have already used it to crunch a few minor data runs.
In setting everything up I also ran into a similar problem as Raghav and ended up solving it in a similar manner by forcing the -i credentials switch. I imagine it has something to do with the way I configured and placed my certs.
i am trying to compile a simple c mpi file "hellompi.c" using the command:
why does it give me the following error?
/usr/bin/ld: cannot open output file /usr/hellompi: Permission denied
collect2: ld returned 1 exit status
how do I get root priveledges?
Raghav,
You can ssh in as root instead of lamuser, or compile the output file into your home directory.
Check out the new AMI and managment code:
http://www.datawrangling.com/pycon-2008-elasticwulf-slides.html
The new AMI includes a preconfigured NFS mounted directory /home/beowulf. If you compiled the file there, hellompi would be available on all nodes.
Note that the new images default to the 'large' instance type which charges .40 cents/hour for each node.
-Pete
Peter,
Very useful tool! I've gotten a cluster up and running using the small instance type but am having difficulty launching the _64 AMIs.
$ ./ec2-start-cluster.py
m1.large
image ami-eb13f682
master image ami-e813f681
----- starting master -----
Traceback (most recent call last):
File "./ec2-start-cluster.py", line 39, in ?
master_response = conn.run_instances(imageId=MASTER_IMAGE_ID, minCount=1, maxCount=1, keyName= KEYNAME, instanceType=INSTANCE_TYPE )
TypeError: run_instances() got an unexpected keyword argument 'instanceType'
If I try to start the cluster without passing an INSTANCE_TYPE arg I get the following:
$ ./ec2-start-cluster.py
m1.large
image ami-eb13f682
master image ami-e813f681
----- starting master -----
InvalidParameterValue: The requested instance type's architecture (i386) does not match the architecture in the manifest for ami-e813f681 (x86_64)
----- starting workers -----
InvalidParameterValue: The requested instance type's architecture (i386) does not match the architecture in the manifest for ami-eb13f682 (x86_64)
Any ideas? Thanks!
Patrick,
Did you start with a clean install of the 64 bit scripts? I made some changes to EC2.py in the new scripts to handle the new instance types...
Peter:
I am diving into Hadoop with Map/Reduce as we speak. As you know Google implemented its environment in C++, so I was a bit disappointed that Hadoop had chosen Java VM to do its bidding. Java makes interfacing with hardcore numerical operations much harder. The particular problems I am looking at are large scale Lanczos solvers to find eigen values/vectors of large systems of equations. These systems are of interest in advertising, quantitative finance, and sensor networks. Problem is that they all are environments in which latency is of the essence. So you have a capacity component in terms of the size of the system and a latency issue in terms of the data rate coming in and the opportunity cost for somebody to get to the answer faster.
I would be interested in working on this particular benchmark problem: pick a big eigen value/vector problem and solve it on a cluster, EC2, and via Hadoop/Map-reduce. Clearly this is going to be a lot of work so this should be publishing worthy. I am sure many folks would be interested in this experiment, so let me know if this is something that could invest time in.
Theo
Thanks, Peter. The original EC2.py was the problem. I now have the large AMIs up and running. Thanks again for the article and help!
Patrick
I found the secret to avoiding a lot of MPI errors on EC2, but haven't found time to do an additional post...
The secret seems to be that just because Amazon says that an instance is "running", doesn't mean that the ssh daemons are available. This caused all kinds of intermittent problems setting up the hosts and my old scripts would fail silently.
In my current codebase, I do some checks like the following:
<pre lang="python">
print "Instance is %s" % BOOTING_INSTANCE
# wait for instance description to return "running" and grab HOSTNAME variable
print "Polling server status (ec2-describe-instances %s)" % BOOTING_INSTANCE
while 1:
print "waiting for instance to boot..."
HOSTNAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $4}'" % BOOTING_INSTANCE)
if len(HOSTNAME) > 1:
print "-------Instance booted, The server is available at %s" % HOSTNAME
DOM_NAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $5}'" % BOOTING_INSTANCE).split('.')[0]
break
time.sleep(1)
# sometimes it takes a while for the ssh service to start, even when the ec2 api describes an instance as running.
# A machine in the "running" state may not have finished booting. Try executing a no-op command until a valid response is found
print "verifying ssh daemon has started..."
counter=0
while 1:
print "Waiting for ssh daemon to start..."
counter += 1
REPLY = commands.getoutput('''ssh %s "root@%s" 'echo "hello"' ''' % (SSH_OPTS, HOSTNAME) )
if REPLY == 'hello':
print "-------ssh has started, proceeding with AMI build"
break
if counter > 24:
print "Instance not respoding to SSH hails, aborting..."
## sshd should not take more than 2 minutes to launch
terminate_status = commands.getoutput('ec2-terminate-instances %s' % BOOTING_INSTANCE)
ec2_launch_failed = True
print "Base Instance terminated"
break
time.sleep(5)
if ec2_launch_failed:
print "Aborting build"
return
</pre>
@Theo,
I'm attending MMDS this week at Stanford (http://www.stanford.edu/group/mmds/), and had a chance to ask James Demmel a few questions. He gave a talk titled "Avoiding communication in linear algebra algorithms", which was very relevant. His advice for matrix multiplication in a high latency environment like EC2 was to try dialing up the block size as much as possible in the standard MPI solvers and see how performance was affected.
-Pete
Hi Peter,
Have you tried to connect EC2 instances with your local desktops? I am trying to do that with mpich2 1.0.7 but I am not successful at all. mpdboot complains about invalid port info (no_port) - actually no port when I try to do mpdboot -n 2. Even when I tried to mpd& on EC2 machine and then mpdtrace -l and then unblock the port and then mpd -h ec2-blabla -p ec2-mpdtrace-l-port still I have no luck. Have you faced similar problems?
Thanks
- magg
Magg,
I wouldn't recommend it, the latency would be huge and I'm not sure how MPI would handle that. You would also need to open the mpi ports to the outside world using the EC2 security group authorize commands.
An alternative is to open an X11 session and connect to the head node or maybe VNC in to the instance. The 64 bit elasticwulf images are set up for X11 sessions and adding a desktop package would allow you to VNC in if you prefer that route.
-Pete
mpdboot_domU-blah(handle_mpd_output 414): from mpd on domU-blah-,
invalid port info:
no_port
any word on this?
If you are interested in running MPI on EC2, I have a new project on Github I'll be announcing soon:
http://github.com/datawrangling/ec2cluster/tree...
Great project and thanks very much for sharing! I do have some trouble getting it all to work though. Everything works fine until it tries to run the create_hosts.py:
/////// OUTPUT ///////////////
Creating hosts file on master node and copying hosts file to compute nodes...
pscp -scp -i D:\grid\keys\keypair.ppk -q create_hosts.py root@ec2-67-202-19-253.
compute-1.amazonaws.com:/etc/
plink -ssh -i D:\grid\keys\keypair.ppk root@ec2-67-202-19-253.compute-1.amazonaw
s.com "python /etc/create_hosts.py"
exporting 10.252.31.48:/home/beowulf
exporting 10.252.31.48:/mnt/data
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/root/.ssh/id_rsa' are too open.
It is recommended that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /root/.ssh/id_rsa
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
lost connection
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/root/.ssh/id_rsa' are too open.
It is recommended that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /root/.ssh/id_rsa
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-with-mic,password).
lost connection
etcetera
//////////////////////////////////
As you can see I made some small modifications in order to use PuTTy as my SSH client, but that does not seem to be the problem... Does anyone else have this problem, and does anyone know how to fix it?
Got it working using OpenSSH, guess PuTTy was the problem after all.
Hi,
Thanks for your writeup! It's very helpful. I'm running into an error with mpdtrace and was hoping for some of your insight into it. I am running mpd as root, with one node for simplicity.
I can successfully start up mpd on the instance and "mpd &":
root@...:/etc# mpdboot -n 1 -f mpd.hosts
root@...:/etc# mpd &
[1] 2280
but "mpdtrace -l" gives me an error:
root@ip-10-251-143-0:/etc# mpdtrace -l
mpdtrace: unexpected msg from mpd=:{'error_msg': 'invalid secretword to root mpd'}:
I have tried all pairwise combinations of having MPD_SECRETWORD= or secretword= in ~/.mpd.conf and /etc/mpd.conf, all of which were set to read/write for root only.
I also can't do "mpdallexit":
I can't mpdallexit:
root@...:~# mpdallexit
mpdallexit: mpd_uncaught_except_tb handling:
: 'cmd'
/usr/local/bin/mpich2-install/bin/mpdallexit 53 mpdallexit
elif msg['cmd'] != 'mpdallexit_ack':
/usr/local/bin/mpich2-install/bin/mpdallexit 59
mpdallexit()
I can also run mpdcheck as a server and have it listen for mpdcheck as a client from the same instance (in a different window).
Suggestions/help? I'd greatly appreciate any advice you have on this problem. Thanks --
<ul>
<li>Joanne</li>
</ul>
Joanne,
Try logging in and running your commands as "lamuser" instead of root. The default configuration assumes lamuser is running all commands.
$ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com
See part 1 of the post for details on changing the configuration to run MPI as root.
-Pete
@Pete - Feb08
You are right - the output from ec2-describe-instances has changed. Do the following..
Change
machine_state.append(chunk[-1])
to
machine_state.append(chunk[5])
in "ec2-mpi-config.py"
Or, if the output changes again - just do an "ec2-describe-instances" and match up the required fields to the index on the chunk[] array
Hello,
I get the same problem that Michael Creel was having. I am able to start the instances and get them "running" successfully, by pointing them to my keypair with the KEYNAME variable, but I believe my KEY_LOCATION variable in my EC2config.py file must be causing the prompt for a password.
This is all per the default block of code in EC2config.py:
change this to your keypair location (see the EC2 getting started guide tutorial on using ec2-add-keypair)
KEYNAME = "my_keypair"
KEY_LOCATION = "/Users/pskomoroch/id_rsa-gsg-keypair"
I believe this requires me to go back through the "getting started guide", but I just wanted to update my progress in case others are seeing the same thing.
Many thanks for sharing your progress Peter!
Ben Racine
Hello Pete -
Thanks for putting together your ElsticWulf scripts and AMIs. They have saved me a huge amount of time and effort compared with building my own from scratch.
I am interested in parallelizing some machine learning algorithms written in R. My interest in ElasticWulf comes partly from the fact that R is already bundled with its AMIs. I discovered, however, that Rmpi is not one of the installed packages. What were your intentions/plans with R in the ElasticWulf environment? Did you plan for parallel communication using a mechanism other than MPI?
It wasn't hard to install Rmpi on top of the ElasticWulf AMI, but despite a couple of days' struggle, I haven't found a combination of Rmpi version and paths to the AMI's existing MPI libraries that fully works. The best I've been able to do is spawn an R cluster where all the nodes are running on the master node.
Could you tell me what version of R and the various MPI implementations (OpenMPI, MPICH, LAM) went into your 64-bit AMI? That might help me sort things out. A couple of observations, for what they're worth:
1) Once I have an ElasticWulf cluster up and have run mpdboot, I find that mpiexec works, but orterun (the equivalent in OpenMPI) does not.
2) There has been at least one report of problems between the latest version of Rmpi and OpenMPI:
https://stat.ethz.ch/pipermail/r-sig-hpc/2009-February/000105.html
Much thanks.
Jeff Howbert
Jeff,
I'm actually working on a new version of Elasticwulf right now. Shoot me an email at pete@datawrangling.com and I'll try to include what you need for Rmpi. If you have some sample Rmpi code you want to test and that you don't mind releasing, we can build that into the AMI to ensure everything you need is installed.
Here are the MPI installs that were included on that Fedora 64 bit image:
<pre>
# mpich2
cd /usr/local/src/
wget http://www.mcs.anl.gov/research/projects/mpich2...
tar -xzvf mpich2-1.0.6p1.tar.gz
cd mpich2-1.0.6p1
./configure --enable-sharedlibs=gcc --prefix=/usr/local/mpich2
make
make install
# openmpi
cd /usr/local/src/
wget http://www.open-mpi.org/software/ompi/v1.2/down...
#wget http://www.open-mpi.de/software/ompi/v1.2/downl...
tar -zxf openmpi-1.2.5.tar.gz
cd openmpi-1.2.5
./configure --prefix=/usr/local/openmpi
make all
make install
#lam
cd /usr/local/src/
wget http://www.lam-mpi.org/download/files/lam-7.1.2...
tar -xzvf lam-7.1.2.tar.gz
cd lam-7.1.2
./configure --enable-shared --prefix=/usr/local/lam
make
make install
# mpich1
cd /usr/local/src/
wget http://www-unix.mcs.anl.gov/mpi/mpich1/download...
tar -zxf mpich.tar.gz
cd mpich-1.2.7p1/
./configure --enable-sharedlib --prefix=/usr/local/mpich
make
make install
</pre>
Andrew
http://github.com/datawrangling/ec2cluster/tree...
Thanks for the awesome ec2cluster project! I forked it on github so that I could add the ability to install R packages from CRAN across all the nodes in the cluster. Basically I added the following to ubuntu_installs.sh:
--snip--
# Custom R packages
cat <<EOF >> /home/ec2cluster/install_custom_packages.R
install.packages("DEoptim",repos="http://cran.stat.ucla.edu")
EOF
R CMD BATCH /home/ec2cluster/install_custom_packages.R
--snip--
A bit crude I know, but something people wanting to do things with R will probably find useful.
To actually run the R code, I was successful with the follow approach:
1) put your R code in a file and save it as foo.R
2) add the following line to a shellscript: mpirun -n 1 -hostfile /mnt/ec2cluster/openmpi_hostfile R CMD BATCH foo.R
3) call the shellscript and grab the produced .Rout file to your S3 bucket!
Glad ec2cluster helped, are you guys big R users at StockTwits?
-Pete
Would this be compatible with your cluster app? I notice that it's more schedule-focused and the nodes need to talk to the app; is the app acting as the master rather than one of the ec2 nodes? Ideally my architecture would be something like master Rmpi node running on ec2 talking to arbitrary slave #nodes, accessed through maybe something like the Biocep remote R client (http://biocep-distrib.r-forge.r-project.org/); clustering done in-session.
The web interface can be run anywhere, but needs to be https accessible to the EC2 nodes. I usually just run it as a small ec2 instance as shown in the docs. You can start a job from the API or the web console with shutdown_after_complete = false, and the cluster will remain live for interactive work, just ssh into the master node like you would with Elasticwulf. The app is not acting as the MPI master node, but the cluster nodes do talk to the app to handle configuration etc.
I had error with Fortran 77 library in AMI ami-e813f681 (Fedora core 6 x86- 64 bit) because of two libraries: libf2c or libg2c in the AMI.
(....
checking for f_exit in -lf2c... no
checking for f_exit in -lg2c... no
checking for dummy main to link with Fortran 77 libraries... unknown
configure: error: linking to Fortran libraries from C fails
)
Could anybody help me solve the problem? Thanks so much!
best regards,
Harry