DISQUS

Data Wrangling Blog: MPI Cluster with Python and Amazon EC2 (part 2 of 3)

  • Michael Creel · 2 years ago

    Excellent stuff! I've gotten started with EC2 and I'll be trying your images out soon. I doubt that I'll be trying to make ParallelKnoppix work on EC2, because your approach is the right one, I think. PK is designed to use when the hardware is not known ahead of time. With EC2, the hardware is known, so a tailor-made image is the way to go. Your scripts allow an on-demand cluster to be created in minutes, and that's all that PK offers, anyway. PK usually needs some remastering so that users can add their own packages. Re-bundling an EC2 image is completely analogous. I'm planning on doing just that, probably starting with your images, and doing some testing of latency on tasks that require different degrees of internode communication. Thanks for all this, it'll make the rest an easy job.

  • Michael Creel · 2 years ago

    One question, do you know if something like an NFS shared home directory is possible. Using S3, possibly?

  • Michael Creel · 2 years ago

    A little report on my trial.
    1) ./ec2-start_cluster.py is not always successful in getting the requested number of nodes to come up. The instances sometimes have status "terminated" before anything is done with them.


    2) When the 5 nodes all come up, I still get a problem with ./ec2-mpi-config.py requesting a root password:


    michael@yosemite:~/ec2/AmazonEC2_MPI_scripts$ ./ec2-mpi-config.py


    ---- MPI Cluster Details ----
    Numer of nodes = 5
    Instance= i-e39c7a8a hostname= ec2-72-44-45-138.z-2.compute-1.amazonaws.com state= running
    Instance= i-e29c7a8b hostname= ec2-72-44-45-185.z-2.compute-1.amazonaws.com state= running
    Instance= i-e59c7a8c hostname= ec2-72-44-45-186.z-2.compute-1.amazonaws.com state= running
    Instance= i-e49c7a8d hostname= ec2-72-44-45-122.z-2.compute-1.amazonaws.com state= running
    Instance= i-e79c7a8e hostname= ec2-72-44-45-60.z-2.compute-1.amazonaws.com state= running


    The master node is ec2-72-44-45-138.z-2.compute-1.amazonaws.com


    Writing out mpd.hosts file
    nslookup ec2-72-44-45-138.z-2.compute-1.amazonaws.com
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-138.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.138\n')
    nslookup ec2-72-44-45-185.z-2.compute-1.amazonaws.com
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-185.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.185\n')
    nslookup ec2-72-44-45-186.z-2.compute-1.amazonaws.com
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-186.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.186\n')
    nslookup ec2-72-44-45-122.z-2.compute-1.amazonaws.com
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-122.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.122\n')
    nslookup ec2-72-44-45-60.z-2.compute-1.amazonaws.com
    (0, 'Server:\t\t158.109.0.1\nAddress:\t158.109.0.1#53\n\nNon-authoritative answer:\nName:\tec2-72-44-45-60.z-2.compute-1.amazonaws.com\nAddress: 72.44.45.60\n')
    Warning: Permanently added 'ec2-72-44-45-138.z-2.compute-1.amazonaws.com,72.44.45.138' (RSA) to the list of known hosts.
    id_rsa.pub 100% 1675 1.6KB/s 00:00
    root@ec2-72-44-45-138.z-2.compute-1.amazonaws.com's password:


    This is as far as I can get at the moment. Looks like a minor problem. Cheers, M.

  • Peter Skomoroch · 2 years ago

    Michael,


    I haven't had the scripts prompt me for a password before, are you running them from your local machine? The mpi-config script expects the keyname and keypair location to match what was used to start the instance. Take a look at your EC2config.py file and make sure the instances were all started with your own keypair (i used the gsg keypair I created on my laptop in the Amazon "getting started guide" tutorial):





    AWS_ACCESS_KEY_ID = ‘YOUR_KEY_ID_HERE’
    AWS_SECRET_ACCESS_KEY = ‘YOUR_KEY_HERE’
    MASTER_IMAGE_ID = "ami-3e836657"
    IMAGE_ID = "ami-3e836657"
    KEYNAME = "gsg-keypair"
    KEY_LOCATION = "~/id_rsa-gsg-keypair"
    DEFAULT_CLUSTER_SIZE = 5





    I'm working on an updated version of the scripts and EC2 image which should make things a bit cleaner. Sorry the code is ugly right now in terms of error handling...I just wanted to toss something together to get people started :)

  • Michael Creel · 2 years ago

    Yep, I run the mpi-config script right after creating the instances, doing just what you suggest. The fact that the instances start up at all seems to me to mean that the keypair information is ok. Do you know if anyone but you has been able to launch a cluster? Very cool stuff. I'm going to be looking into making a Debian AMI that works the same way.

  • Peter Skomoroch · 2 years ago

    Mike Cariaso modified my scripts to fix some path issues and got it working on a windows laptop, he might have also fixed some other errors I didn't notice. I haven't had a chance to try them yet, but you can download the modified scripts here:


    http://mpiblast.pbwiki.com/AmazonEC2

  • Ralph Giles · 2 years ago

    ===== DO NOT USE THESE SCRIPTS! =====


    This section of ec2-mpi-config.py is a bit problematic:


    os.system('cp %s ~/id_rsa.pub' % KEY_LOCATION )
    os.system('cp ~/id_rsa.pub ~/.ssh/id_rsa')


    This will clobber any existing rsa key on the initiating machine's account, and with break normal auth on the next login if you have a different default rsa key!


    The script should instead copy the private key directly from KEY_LOCATION to the nodes.


    ===== DO NOT USE THESE SCRIPTS! =====


    Otherwise, way cool. Thanks for putting this tutorial together. We're trying EC2 clusters out as a way to get quicker feedback from regression tests after changes to our software. Unfortunately, with the one hour granularity I don't think it will be price competitive. We want 20-100 nodes for about 5 minutes at a time.

  • Peter Skomoroch · 2 years ago

    Ralph,


    Good catch. Thanks for pointing that out. I just lifted those passwordless ssh lines straight from an MPI tutorial.


    This might solve the clobbering as well (from http://www.maclife.com/forums/topic/61520):


    cat id_rsa.pub >> .ssh/authorized_keys


    "The above command will create the "authorized_keys" file in the ".ssh" directory if that file doesn't already exist, and it will append the new id_rsa.pub file to it if it does already exist."


    I'll add that change to the scripts. Good luck with the regression cluster, I heard Oracle developers do something like that using Condor on otherwise idle desktops (see http://www.cs.wisc.edu/condor/doc/nmi-lisa2006-slides.pdf).


    -Pete

  • Ralph Giles · 2 years ago

    Yeah, that would work better. Some more detailed comments:


    <ul>
    <li>

    Your image has /home/lamuser/.mpd.conf owned by root. I had to chown it to lamuser before I could start mpd.

    </li>
    <li>

    You script passes the public dns names for the nodes into mpd.hosts. For that to work, a hole has to be opened in the firewall for the ports the mpi daemon is using. A simpler solution is to just pass the internal dns names. Then all the traffic happens behind the firewall, which probably also improves latency. (Although my ringtest was noticably slower than yours, averaging 2.2e-3 seconds/loop so who knows?)

    </li>
    <li>

    I was surprised that when I originally ran ec2-add-keypair in the EC2 tutorial that it uploaded the public key (ok) and printed out the private key (ok I guess) but didn't print out the public key locally (weird). Your scripts seem to assume the public key is available as id_rsa.pub on the client machine. Shouldn't this first be copied either from /root/.ssh/authorized_keys on the master node (as installed by amazon) or retrieved through the query interface?

    </li>
    </ul>

    Is the mutual ssh access required for more than just launching the MPI daemon? If all subsequent traffic goes through the mpi daemons, starting mpd from the client machine, or automatically from the init scripts after pulling mpd.hosts from S3 would save the whole trouble, including uploading the private key at all.

  • Peter Skomoroch · 2 years ago

    Ralph,


    More good points. I've been tied up with some other projects, but it sounds like enough feedback is in to make a revised version of the image and scripts. I expect the latency to vary a bit depending on the random EC2 network topology when a cluster is launched...(instances on the same box vs. over ethernet) that might explain the ringtest. The mutual ssh access was set up since we do a lot of file/data shuffling between nodes outside of MPI.


    Thanks again, looking forward to hearing how the regression test system works out.


    -Pete

  • Peter Skomoroch · 2 years ago

    Update (7-24-07): I’ve made some important bug fixes to the scripts to address issues mentioned in the comments.


    Specific changes made:


    <ul>
    <li>fixed lamuser home directory permissions bug</li>
    <li>fixed section of ec2-mpi-config.py which clobbered existing rsa keys on the client machine</li>
    <li>Updated calls of AWS python EC2 library to use API version 2007-01-19
    http://developer.amazonwebservices.com/connect/...>
    <li>fixed mpdboot issue by using amazon internal DNS names in hosts files</li>
    <li>scripts should now work on windows/cygwin client environments</li>
    </ul>

    After I run some benchmarks, I'm hoping to find some time to add LAM and OpenMPI to the EC2 image along with NFS configuration, C3 cluster tools, Ganglia, and a benchmarking package.

  • Soo.. · 2 years ago

    What about that Part 3? :)

  • Patrick Ball · 2 years ago


    the first two parts really set the stage ... Part 3?


    :)

  • Theo · 2 years ago

    Does the 5 month hiatus in this project mean that it was a bad idea and you guys have learnt enough to waste no more time on it?


    Given the virtualization uncertainty, finding the right communication/computation balance for typical MPI programs appears to be very unrewarding. Secondly, MPI development and debug and then QA and scale out are not addressed, which doesn't bode well. It appears most productive to have a local small cluster for development and debug, and then do QA and scale out on EC2, but some benchmarking numbers would really help.


    If EC2 is only robust for embarrassingly parallel problems, then MapReduce style programs are more attractive. There the size of the data set and how well it integrates in a distributed file system appear to be the problems to focus on. Or BOINC like approaches if there is no integrated DFS. Anyone have operational data on these approaches?

  • Peter Skomoroch · 1 year ago

    Theo,


    Sorry for the delay in posting this and responding. I've been working on a startup for the past 7 months and was in serious crunch mode. Don't read too much into the large gap in posts, it is just me working on this as a side-project. I finished moving the blog to another host and finally have some time to get back to the EC2 work. This experience has taught me to never name a series of blog posts "part 1 of N" :)


    You make some excellent points. One thing that has changed since I wrote the first post is that EC2 now offers larger 64bit machine images with better I/O (you can provision an entire physical server and not be limited by sharing network resources in the virtual instance). I'd like to see if this improves the network performance. I'm giving a talk on this in March, so I'm on the hook to have some benchmarks by then.


    I also agree on the mapreduce side. For embarrassingly parallel problems, hadoop on ec2 is potentially much more attractive...more robust, easier for most people to program. Ideally, I would like to do some comparisons between the two approaches and run the numbers.


    The performance of an EC2 MPI cluster is definitely going to be worse than your own custom hardware, but it still might fit certain niche situations. In my case, I needed to run some MPI code for a large problem and didn't have access to a large enough cluster. The performance on EC2 was nowhere near what you get on a high-end cluster, but it got the job done for a reasonable price.


    This discussion on the beowulf list goes into more detail on the pros/cons:


    http://www.beowulf.org/pipermail/beowulf/2008-January/020490.html


    -Pete

  • pete · 1 year ago

    Can't get the ec-mpi-config to work. Says list index out of range for mpi-externalnames[0] on line 108
    start cluster and check instances are OK so I think that python, EC, elementree
    are OK
    Any ideas why? Has AWS changed the format of the response you're parsing (yes I have had a look at the python code but since I haven't used python before I can't see anything obvious to me)
    BTW you have a typo in mpi config Numer of nodes as opposed to Number of nodes , it even shows in your example above.
    Otherwise I like what you've done, I'd just like it to work for me.
    Thanks,
    Pete

  • Peter Skomoroch · 1 year ago

    pete found the error... the image Ids he entered into the config module inadvertently contained a capital letter. This doesn’t cause any problems for starting images since string case is ignored by Amazon. The corresponding image id response string from AWS is always lowercase, so the python script comparison on image ID string fails.


    In the next version of the scripts, I will handle upper/lowercase differences in the ami strings. For now, just make sure to use all lower case or call the python .lower() method,


    <pre lang="python">
    >>> test = 'ami-fE9a7f97'
    >>> test.lower()
    'ami-fe9a7f97'
    >>>

    </pre>
  • pete · 1 year ago

    Found another typo too, ok I'm nit picking. In the stop-cluster script the message says Stoping as opposed to stopping. A year ago when you first posted this stuff you mentioned that the reason why the non-root user was called lamuser was that the scripts were used for LAM in some previous incarnation. Since I'm actually trying to use LAM, if you have any LAM stuff around that might help me to iron out one or two problems I still have.


    Anyway, thanks again,
    Pete

  • Peter Skomoroch · 1 year ago

    No problem, thanks for finding the typos. These were meant to be some quick hacks, but took on a life of their own after a while.


    I found this worked for configuring LAM, I'll send you more details in an email...


    The contents of bash_profile should be as follows:


    <pre lang="code">
    -bash-3.1# more .bash_profile
    # .bash_profile

    # Get the aliases and functions
    if [ -f ~/.bashrc ]; then
    . ~/.bashrc
    fi

    # User specific environment and startup programs

    LAMRSH="ssh -x"
    export LAMRSH

    LD_LIBRARY_PATH="/usr/local/lam-7.1.2/lib/"
    export LD_LIBRARY_PATH

    MPICH_PORT_RANGE="2000:8000"
    export MPICH_PORT_RANGE

    PATH=$PATH:$HOME/bin

    PATH=/usr/local/lam-7.1.2/bin:$PATH

    MANPATH=/usr/local/lam-7.1.2/man:$MANPATH

    export PATH
    export MANPATH
    </pre>

    Launch the cluster on EC2 and try booting LAM manually:


    <pre lang="code">
    [lamuser@domU-12-31-33-00-04-4B ~]$ lamboot /etc/mpd.hosts

    [lamuser@domU-12-31-33-00-04-4B ~]$ lamnodes
    n0 domU-12-31-33-00-04-4B.usma1.compute.amazonaws.com:1:origin,this_node
    n1 domU-12-31-33-00-03-35.usma1.compute.amazonaws.com:1:
    n2 domU-12-31-33-00-03-3C.usma1.compute.amazonaws.com:1:
    n3 domU-12-31-34-00-00-55.usma2.compute.amazonaws.com:1:

    [lamuser@domU-12-31-33-00-04-4B ~]$ tping N -c3
    1 byte from 3 remote nodes and 1 local node: 0.039 secs
    1 byte from 3 remote nodes and 1 local node: 0.004 secs
    1 byte from 3 remote nodes and 1 local node: 0.002 secs
    </pre>
  • raghav · 1 year ago

    Why does it ask me for a password when i try to run the ec2-mpi-config.py file.?
    it says root@xxx password:
    And I get a lot of text on the terminal when I try running the file.

  • Peter Skomoroch · 1 year ago

    raghav,


    I assume you were able to start the instances with ec2-start-cluster.py? The text on the terminal is normal, but it shouldn't ask you for a password (I should probably add a verbose option instead of streaming out text by default). There was a path issue on windows with an earlier version of the scripts, so that may be the problem.


    If you send me the script version number from the README and/or terminal output, I can try to track down what is going on...


    peter.skomoroch@gmail.com


    -Pete

  • Peter Skomoroch · 1 year ago

    raghav,


    Another suggestion is to make sure the instances are running with ./ec2-check-instances.py and then retry the script, sometimes it takes a while for sshd to start up on EC2.


    -Pete

  • raghav · 1 year ago

    Hey guys,
    Actually I made a change in the ec2-mpi-cluster.py file. I have no clue about python and I dono why it worked but it worked.


    I modified:


    template = ssh -o "StrictHostKeyChecking no" %(user)s@%(host)s "%(cmd)s"
    to
    template = 'ssh -i "/home/id_rsa-gsg-keypair" %(user)s@%(host)s "%(cmd)s"


    and


    template = '%(cmd)s %(switches)s -o "StrictHostKeyChecking no" %(src)s %(user)s@%(host)s:%(dest)s'
    to
    template = '%(cmd)s %(switches)s -i "/home/id_rsa-gsg-keypair" %(src)s %(user)s@%(host)s:%(dest)s'


    And it started working perfectly fine. I was able to log in to the master node and the pi problem executed perfectly fine.


    Thanks a lot guys


    Cheers,
    Raghav

  • raghav · 1 year ago

    Thanks pete. For your prompt reply!!

  • Kurt Grandis · 1 year ago

    Thanks Pete. I wish I had made the PyCon session, but these posts have been very helpful. The cluster went up pretty quickly and I have already used it to crunch a few minor data runs.
    In setting everything up I also ran into a similar problem as Raghav and ended up solving it in a similar manner by forcing the -i credentials switch. I imagine it has something to do with the way I configured and placed my certs.

  • raghav · 1 year ago

    i am trying to compile a simple c mpi file "hellompi.c" using the command:



    mpicc -o /usr/hellompi /usr/local/src/hellompi.c



    why does it give me the following error?


    /usr/bin/ld: cannot open output file /usr/hellompi: Permission denied
    collect2: ld returned 1 exit status


    how do I get root priveledges?

  • Peter Skomoroch · 1 year ago

    Raghav,


    You can ssh in as root instead of lamuser, or compile the output file into your home directory.


    Check out the new AMI and managment code:


    http://www.datawrangling.com/pycon-2008-elasticwulf-slides.html


    The new AMI includes a preconfigured NFS mounted directory /home/beowulf. If you compiled the file there, hellompi would be available on all nodes.


    Note that the new images default to the 'large' instance type which charges .40 cents/hour for each node.


    -Pete

  • Patrick · 1 year ago

    Peter,


    Very useful tool! I've gotten a cluster up and running using the small instance type but am having difficulty launching the _64 AMIs.


    $ ./ec2-start-cluster.py
    m1.large
    image ami-eb13f682
    master image ami-e813f681
    ----- starting master -----
    Traceback (most recent call last):
    File "./ec2-start-cluster.py", line 39, in ?
    master_response = conn.run_instances(imageId=MASTER_IMAGE_ID, minCount=1, maxCount=1, keyName= KEYNAME, instanceType=INSTANCE_TYPE )
    TypeError: run_instances() got an unexpected keyword argument 'instanceType'


    If I try to start the cluster without passing an INSTANCE_TYPE arg I get the following:
    $ ./ec2-start-cluster.py
    m1.large
    image ami-eb13f682
    master image ami-e813f681
    ----- starting master -----
    InvalidParameterValue: The requested instance type's architecture (i386) does not match the architecture in the manifest for ami-e813f681 (x86_64)
    ----- starting workers -----
    InvalidParameterValue: The requested instance type's architecture (i386) does not match the architecture in the manifest for ami-eb13f682 (x86_64)


    Any ideas? Thanks!

  • Peter Skomoroch · 1 year ago

    Patrick,


    Did you start with a clean install of the 64 bit scripts? I made some changes to EC2.py in the new scripts to handle the new instance types...

  • Theo · 1 year ago

    Peter:


    I am diving into Hadoop with Map/Reduce as we speak. As you know Google implemented its environment in C++, so I was a bit disappointed that Hadoop had chosen Java VM to do its bidding. Java makes interfacing with hardcore numerical operations much harder. The particular problems I am looking at are large scale Lanczos solvers to find eigen values/vectors of large systems of equations. These systems are of interest in advertising, quantitative finance, and sensor networks. Problem is that they all are environments in which latency is of the essence. So you have a capacity component in terms of the size of the system and a latency issue in terms of the data rate coming in and the opportunity cost for somebody to get to the answer faster.


    I would be interested in working on this particular benchmark problem: pick a big eigen value/vector problem and solve it on a cluster, EC2, and via Hadoop/Map-reduce. Clearly this is going to be a lot of work so this should be publishing worthy. I am sure many folks would be interested in this experiment, so let me know if this is something that could invest time in.


    Theo

  • Patrick · 1 year ago

    Thanks, Peter. The original EC2.py was the problem. I now have the large AMIs up and running. Thanks again for the article and help!


    Patrick

  • Peter Skomoroch · 1 year ago

    I found the secret to avoiding a lot of MPI errors on EC2, but haven't found time to do an additional post...


    The secret seems to be that just because Amazon says that an instance is "running", doesn't mean that the ssh daemons are available. This caused all kinds of intermittent problems setting up the hosts and my old scripts would fail silently.


    In my current codebase, I do some checks like the following:


    <pre lang="python">
    print "Instance is %s" % BOOTING_INSTANCE

    # wait for instance description to return "running" and grab HOSTNAME variable
    print "Polling server status (ec2-describe-instances %s)" % BOOTING_INSTANCE
    while 1:
    print "waiting for instance to boot..."
    HOSTNAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $4}'" % BOOTING_INSTANCE)
    if len(HOSTNAME) > 1:
    print "-------Instance booted, The server is available at %s" % HOSTNAME
    DOM_NAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $5}'" % BOOTING_INSTANCE).split('.')[0]
    break
    time.sleep(1)

    # sometimes it takes a while for the ssh service to start, even when the ec2 api describes an instance as running.
    # A machine in the "running" state may not have finished booting. Try executing a no-op command until a valid response is found
    print "verifying ssh daemon has started..."
    counter=0
    while 1:
    print "Waiting for ssh daemon to start..."
    counter += 1
    REPLY = commands.getoutput('''ssh %s "root@%s" 'echo "hello"' ''' % (SSH_OPTS, HOSTNAME) )
    if REPLY == 'hello':
    print "-------ssh has started, proceeding with AMI build"
    break
    if counter > 24:
    print "Instance not respoding to SSH hails, aborting..."
    ## sshd should not take more than 2 minutes to launch
    terminate_status = commands.getoutput('ec2-terminate-instances %s' % BOOTING_INSTANCE)
    ec2_launch_failed = True
    print "Base Instance terminated"
    break
    time.sleep(5)

    if ec2_launch_failed:
    print "Aborting build"
    return


    </pre>
  • Peter Skomoroch · 1 year ago

    @Theo,


    I'm attending MMDS this week at Stanford (http://www.stanford.edu/group/mmds/), and had a chance to ask James Demmel a few questions. He gave a talk titled "Avoiding communication in linear algebra algorithms", which was very relevant. His advice for matrix multiplication in a high latency environment like EC2 was to try dialing up the block size as much as possible in the standard MPI solvers and see how performance was affected.


    -Pete

  • magg · 1 year ago

    Hi Peter,


    Have you tried to connect EC2 instances with your local desktops? I am trying to do that with mpich2 1.0.7 but I am not successful at all. mpdboot complains about invalid port info (no_port) - actually no port when I try to do mpdboot -n 2. Even when I tried to mpd& on EC2 machine and then mpdtrace -l and then unblock the port and then mpd -h ec2-blabla -p ec2-mpdtrace-l-port still I have no luck. Have you faced similar problems?


    Thanks
    - magg

  • Peter Skomoroch · 1 year ago

    Magg,


    I wouldn't recommend it, the latency would be huge and I'm not sure how MPI would handle that. You would also need to open the mpi ports to the outside world using the EC2 security group authorize commands.


    An alternative is to open an X11 session and connect to the head node or maybe VNC in to the instance. The 64 bit elasticwulf images are set up for X11 sessions and adding a desktop package would allow you to VNC in if you prefer that route.


    -Pete

  • Abhimanyu · 6 months ago
    I am facing the same problem. I am able to set up the ring manually but mpdboot complains:

    mpdboot_domU-blah(handle_mpd_output 414): from mpd on domU-blah-,
    invalid port info:
    no_port

    any word on this?
  • pskomoroch · 6 months ago
    I think there are some SGE grid solutions now that allow you to add EC2 nodes to your existing cluster, but again the performance of MPI from a local network to EC2 would be horrible...

    If you are interested in running MPI on EC2, I have a new project on Github I'll be announcing soon:

    http://github.com/datawrangling/ec2cluster/tree...
  • Abhimanyu · 6 months ago
    I found the problem. Apparently I had forgotten to include the "chown -R user:user /home/user" command. It didnt have access to the id_rsa file. As root mpdboot would work. Rather silly error message though.
  • Tim Salimans · 1 year ago

    Great project and thanks very much for sharing! I do have some trouble getting it all to work though. Everything works fine until it tries to run the create_hosts.py:


    /////// OUTPUT ///////////////


    Creating hosts file on master node and copying hosts file to compute nodes...


    pscp -scp -i D:\grid\keys\keypair.ppk -q create_hosts.py root@ec2-67-202-19-253.
    compute-1.amazonaws.com:/etc/


    plink -ssh -i D:\grid\keys\keypair.ppk root@ec2-67-202-19-253.compute-1.amazonaw
    s.com "python /etc/create_hosts.py"


    exporting 10.252.31.48:/home/beowulf
    exporting 10.252.31.48:/mnt/data
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @ WARNING: UNPROTECTED PRIVATE KEY FILE! @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    Permissions 0644 for '/root/.ssh/id_rsa' are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.
    bad permissions: ignore key: /root/.ssh/id_rsa
    Permission denied, please try again.
    Permission denied, please try again.
    Permission denied (publickey,gssapi-with-mic,password).
    lost connection
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @ WARNING: UNPROTECTED PRIVATE KEY FILE! @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    Permissions 0644 for '/root/.ssh/id_rsa' are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.
    bad permissions: ignore key: /root/.ssh/id_rsa
    Permission denied, please try again.
    Permission denied, please try again.
    Permission denied (publickey,gssapi-with-mic,password).
    lost connection


    etcetera


    //////////////////////////////////


    As you can see I made some small modifications in order to use PuTTy as my SSH client, but that does not seem to be the problem... Does anyone else have this problem, and does anyone know how to fix it?

  • Tim Salimans · 1 year ago

    Got it working using OpenSSH, guess PuTTy was the problem after all.

  • jjiyunlee · 1 year ago

    Hi,


    Thanks for your writeup! It's very helpful. I'm running into an error with mpdtrace and was hoping for some of your insight into it. I am running mpd as root, with one node for simplicity.


    I can successfully start up mpd on the instance and "mpd &":
    root@...:/etc# mpdboot -n 1 -f mpd.hosts
    root@...:/etc# mpd &
    [1] 2280


    but "mpdtrace -l" gives me an error:
    root@ip-10-251-143-0:/etc# mpdtrace -l
    mpdtrace: unexpected msg from mpd=:{'error_msg': 'invalid secretword to root mpd'}:


    I have tried all pairwise combinations of having MPD_SECRETWORD= or secretword= in ~/.mpd.conf and /etc/mpd.conf, all of which were set to read/write for root only.


    I also can't do "mpdallexit":
    I can't mpdallexit:
    root@...:~# mpdallexit
    mpdallexit: mpd_uncaught_except_tb handling:
    : 'cmd'
    /usr/local/bin/mpich2-install/bin/mpdallexit 53 mpdallexit
    elif msg['cmd'] != 'mpdallexit_ack':
    /usr/local/bin/mpich2-install/bin/mpdallexit 59
    mpdallexit()


    I can also run mpdcheck as a server and have it listen for mpdcheck as a client from the same instance (in a different window).


    Suggestions/help? I'd greatly appreciate any advice you have on this problem. Thanks --


    <ul>
    <li>Joanne</li>
    </ul>
  • Peter Skomoroch · 1 year ago

    Joanne,


    Try logging in and running your commands as "lamuser" instead of root. The default configuration assumes lamuser is running all commands.


    $ ssh lamuser@ec2-72-44-46-78.z-2.compute-1.amazonaws.com


    See part 1 of the post for details on changing the configuration to run MPI as root.


    -Pete

  • ej · 1 year ago

    @Pete - Feb08



    Can’t get the ec-mpi-config to work. Says list >index out of range for mpi-externalnames[0] on >line 108



    You are right - the output from ec2-describe-instances has changed. Do the following..


    Change


    machine_state.append(chunk[-1])


    to


    machine_state.append(chunk[5])


    in "ec2-mpi-config.py"


    Or, if the output changes again - just do an "ec2-describe-instances" and match up the required fields to the index on the chunk[] array

  • Ben Racine · 11 months ago

    Hello,


    I get the same problem that Michael Creel was having. I am able to start the instances and get them "running" successfully, by pointing them to my keypair with the KEYNAME variable, but I believe my KEY_LOCATION variable in my EC2config.py file must be causing the prompt for a password.


    This is all per the default block of code in EC2config.py:



    change this to your keypair location (see the EC2 getting started guide tutorial on using ec2-add-keypair)

    KEYNAME = "my_keypair"
    KEY_LOCATION = "/Users/pskomoroch/id_rsa-gsg-keypair"


    I believe this requires me to go back through the "getting started guide", but I just wanted to update my progress in case others are seeing the same thing.


    Many thanks for sharing your progress Peter!


    Ben Racine

  • Jeff Howbert · 8 months ago

    Hello Pete -


    Thanks for putting together your ElsticWulf scripts and AMIs. They have saved me a huge amount of time and effort compared with building my own from scratch.


    I am interested in parallelizing some machine learning algorithms written in R. My interest in ElasticWulf comes partly from the fact that R is already bundled with its AMIs. I discovered, however, that Rmpi is not one of the installed packages. What were your intentions/plans with R in the ElasticWulf environment? Did you plan for parallel communication using a mechanism other than MPI?


    It wasn't hard to install Rmpi on top of the ElasticWulf AMI, but despite a couple of days' struggle, I haven't found a combination of Rmpi version and paths to the AMI's existing MPI libraries that fully works. The best I've been able to do is spawn an R cluster where all the nodes are running on the master node.


    Could you tell me what version of R and the various MPI implementations (OpenMPI, MPICH, LAM) went into your 64-bit AMI? That might help me sort things out. A couple of observations, for what they're worth:


    1) Once I have an ElasticWulf cluster up and have run mpdboot, I find that mpiexec works, but orterun (the equivalent in OpenMPI) does not.


    2) There has been at least one report of problems between the latest version of Rmpi and OpenMPI:


    https://stat.ethz.ch/pipermail/r-sig-hpc/2009-February/000105.html


    Much thanks.


    Jeff Howbert

  • Pete · 8 months ago

    Jeff,


    I'm actually working on a new version of Elasticwulf right now. Shoot me an email at pete@datawrangling.com and I'll try to include what you need for Rmpi. If you have some sample Rmpi code you want to test and that you don't mind releasing, we can build that into the AMI to ensure everything you need is installed.


    Here are the MPI installs that were included on that Fedora 64 bit image:


    <pre>
    # mpich2

    cd /usr/local/src/
    wget http://www.mcs.anl.gov/research/projects/mpich2...
    tar -xzvf mpich2-1.0.6p1.tar.gz
    cd mpich2-1.0.6p1
    ./configure --enable-sharedlibs=gcc --prefix=/usr/local/mpich2
    make
    make install

    # openmpi

    cd /usr/local/src/
    wget http://www.open-mpi.org/software/ompi/v1.2/down...
    #wget http://www.open-mpi.de/software/ompi/v1.2/downl...
    tar -zxf openmpi-1.2.5.tar.gz
    cd openmpi-1.2.5
    ./configure --prefix=/usr/local/openmpi
    make all
    make install

    #lam
    cd /usr/local/src/
    wget http://www.lam-mpi.org/download/files/lam-7.1.2...
    tar -xzvf lam-7.1.2.tar.gz
    cd lam-7.1.2
    ./configure --enable-shared --prefix=/usr/local/lam
    make
    make install

    # mpich1
    cd /usr/local/src/
    wget http://www-unix.mcs.anl.gov/mpi/mpich1/download...
    tar -zxf mpich.tar.gz
    cd mpich-1.2.7p1/
    ./configure --enable-sharedlib --prefix=/usr/local/mpich
    make
    make install
    </pre>
  • Andrew Lonie · 5 months ago
    Hi - I'd be very interested in a Elasticwulf cluster that supports R, too. Did anything come of this? I'd be happy to be involved.

    Andrew
  • pskomoroch · 5 months ago
    Yes, I have a Rails REST web service on github now for spawning MPI clusters that support R. Haven't had time yet to finish the docs or write a blog post about it. Works fine in operation...

    http://github.com/datawrangling/ec2cluster/tree...
  • Soren Macbeth · 3 months ago
    Hey Peter,

    Thanks for the awesome ec2cluster project! I forked it on github so that I could add the ability to install R packages from CRAN across all the nodes in the cluster. Basically I added the following to ubuntu_installs.sh:
    --snip--
    # Custom R packages
    cat <<EOF >> /home/ec2cluster/install_custom_packages.R
    install.packages("DEoptim",repos="http://cran.stat.ucla.edu")
    EOF

    R CMD BATCH /home/ec2cluster/install_custom_packages.R
    --snip--

    A bit crude I know, but something people wanting to do things with R will probably find useful.

    To actually run the R code, I was successful with the follow approach:

    1) put your R code in a file and save it as foo.R
    2) add the following line to a shellscript: mpirun -n 1 -hostfile /mnt/ec2cluster/openmpi_hostfile R CMD BATCH foo.R
    3) call the shellscript and grab the produced .Rout file to your S3 bucket!
  • pskomoroch · 3 months ago
    Soren,

    Glad ec2cluster helped, are you guys big R users at StockTwits?

    -Pete
  • Soren Macbeth · 3 months ago
    The very first StockTwits prototype used R to generate some statistics as well as generate charts of the stocks being talked about in tweets :)
  • Andrew Lonie · 5 months ago
    Thanks - this is impressive, and a web interface for building clusters would be very nice, but maybe I'm after something slightly closer to your original solution. You might already know that R has an in-language cluster support API built on Rmpi, called SNOW (Simple network of workstations). It allows for various script commands like clusterExport(data) and clusterApply(vector, function) which let you interactively cluster jobs according to the parameter values in a list.
    Would this be compatible with your cluster app? I notice that it's more schedule-focused and the nodes need to talk to the app; is the app acting as the master rather than one of the ec2 nodes? Ideally my architecture would be something like master Rmpi node running on ec2 talking to arbitrary slave #nodes, accessed through maybe something like the Biocep remote R client (http://biocep-distrib.r-forge.r-project.org/); clustering done in-session.
  • pskomoroch · 5 months ago
    Yes, this is compatible with the cluster app. One of the bundled examples runs some calculations with R and SNOW, another uses Rmpi (see the code on Github http://bit.ly/ocKCn ).

    The web interface can be run anywhere, but needs to be https accessible to the EC2 nodes. I usually just run it as a small ec2 instance as shown in the docs. You can start a job from the API or the web console with shutdown_after_complete = false, and the cluster will remain live for interactive work, just ssh into the master node like you would with Elasticwulf. The app is not acting as the MPI master node, but the cluster nodes do talk to the app to handle configuration etc.
  • Andrew Lonie · 5 months ago
    OK thanks I understand. I'll try this out properly; it sounds like exactly what I'm after.
  • Harry · 4 months ago
    Hello everybody,

    I had error with Fortran 77 library in AMI ami-e813f681 (Fedora core 6 x86- 64 bit) because of two libraries: libf2c or libg2c in the AMI.
    (....
    checking for f_exit in -lf2c... no
    checking for f_exit in -lg2c... no
    checking for dummy main to link with Fortran 77 libraries... unknown
    configure: error: linking to Fortran libraries from C fails
    )

    Could anybody help me solve the problem? Thanks so much!

    best regards,

    Harry