<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Data Wrangling Blog - Latest Comments in Amazon Elastic MapReduce: A Web Service API for Hadoop</title><link>http://datawranglingblog.disqus.com/</link><description></description><atom:link href="https://datawranglingblog.disqus.com/amazon_elastic_mapreduce_a_web_service_api_for_hadoop/latest.rss" rel="self"></atom:link><language>en</language><lastBuildDate>Sat, 11 Apr 2009 20:28:17 -0000</lastBuildDate><item><title>Re: Amazon Elastic MapReduce: A Web Service API for Hadoop</title><link>http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop#comment-11078499</link><description>&lt;p&gt;Peter T,&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;Good question, I think this is something Amazon needs to sort out in general.  For elastic mapreduce (EMR) it is easier to work with chunked S3 files at the moment, so I just run a script to copy the raw data up to s3 in chunks using s3cmd.  Outside of EMR I usually use single EBS volumes for datasets &amp;lt; 1 TB,  since they can be directly mounted on EC2 and are easier to work with.  I haven't tried the EBS + EMR hack yet, but will let you know how that goes if I try it.&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;Contact me with the exact data sizes you are dealing with and maybe I can suggest an approach...&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;-Pete&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Peter Skomoroch</dc:creator><pubDate>Sat, 11 Apr 2009 20:28:17 -0000</pubDate></item><item><title>Re: Amazon Elastic MapReduce: A Web Service API for Hadoop</title><link>http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop#comment-11078498</link><description>&lt;p&gt;Peter,&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;Great stuff - thanks very much for posting this.&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;We ahve 100GB-multi TB input files required for various none-java analysis applications.  Would you recommend we use the same approach so beautifully presented above and in the python tutorial - or do you recommend we take a different approach to making the input data available?&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;Peter&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Peter Tonellato</dc:creator><pubDate>Sat, 11 Apr 2009 19:55:13 -0000</pubDate></item><item><title>Re: Amazon Elastic MapReduce: A Web Service API for Hadoop</title><link>http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop#comment-11078497</link><description>&lt;p&gt;Here's an idea for a streaming hack I mentioned on the forums. There is no built in support for pulling data from EBS volumes instead of S3 and boto isn't installed, but you you could use the distributed cache to install boto, mount an EBS public dataset from within the first mapper, and load it onto hdfs. Something like this:&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;1) send a zipped boto src directory, renamed with a .mod extension over the distributed cache&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;2) Add an intial dummy JobFlow step where the mapper opens boto.mod, reads in your credentials, imports boto, and mounta the EBS snapshot using your AWS credentials&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;blockquote&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;br&gt;&lt;p&gt;import zipimport&lt;br&gt;      importer = zipimport.zipimporter('boto.mod')&lt;br&gt;      boto = importer.load_module('boto')&lt;/p&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;mount public dataset via boto:&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;ec2-attach-volume vol-e84aae81 -i i-a09104c9 -d /dev/sdf&lt;br&gt;mkdir /mnt/genbank&lt;br&gt;mount /dev/sdf /mnt/genbank&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;3) after the volume is mounted, shell out within the mapper and copy the data from the EBS volume to HDFS:&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;hadoop fs -put /mnt/genbank /home/hadoop/genbank&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;4) proceed with the next JobFlow step, using hdfs:///home/hadoop/genbank as input&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Peter Skomoroch</dc:creator><pubDate>Fri, 03 Apr 2009 21:06:32 -0000</pubDate></item><item><title>Re: Amazon Elastic MapReduce: A Web Service API for Hadoop</title><link>http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop#comment-11078496</link><description>&lt;p&gt;Thanks Steve, I've been following the Mahout mailing list for a while and look forward to seeing more machine learning algorithms implemented in mapreduce. It is a young project, but I'm interested to see how it develops.&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;This post was extremely short since I wanted to get the announcement out quickly, I'll have a more to post about machine learning with Hadoop soon. Until then, check out &lt;a href="http://delicious.com/pskomoroch/hadoop+machinelearning" rel="nofollow noopener" target="_blank" title="http://delicious.com/pskomoroch/hadoop+machinelearning"&gt;http://delicious.com/pskomo...&lt;/a&gt; and &lt;a href="http://delicious.com/pskomoroch/mapreduce+machinelearning" rel="nofollow noopener" target="_blank" title="http://delicious.com/pskomoroch/mapreduce+machinelearning"&gt;http://delicious.com/pskomo...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Peter Skomoroch</dc:creator><pubDate>Fri, 03 Apr 2009 10:53:48 -0000</pubDate></item><item><title>Re: Amazon Elastic MapReduce: A Web Service API for Hadoop</title><link>http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop#comment-11078495</link><description>&lt;p&gt;Being that this blog is related to machine learning as well, I think some of your readers would be interested to know about the mahout project[1], which is an apache project whose aim is to develop various machine learning algorithms in the mapreduce paradigm, using hadoop.&lt;/p&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;p&gt;[1] mahout: &lt;a href="http://lucene.apache.org/mahout/" rel="nofollow noopener" target="_blank" title="http://lucene.apache.org/mahout/"&gt;http://lucene.apache.org/ma...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Steve Lianoglou</dc:creator><pubDate>Fri, 03 Apr 2009 10:00:40 -0000</pubDate></item></channel></rss>