-
Website
http://www.datawrangling.com/ -
Original page
http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop -
Subscribe
All Comments -
Community
-
Top Commenters
-
yaroslavvb
1 comment · 1 points
-
endah
1 comment · 1 points
-
malar
1 comment · 1 points
-
bearrito
1 comment · 1 points
-
pskomoroch
9 comments · 1 points
-
-
Popular Threads
Being that this blog is related to machine learning as well, I think some of your readers would be interested to know about the mahout project[1], which is an apache project whose aim is to develop various machine learning algorithms in the mapreduce paradigm, using hadoop.
[1] mahout: http://lucene.apache.org/mahout/</p>
Thanks Steve, I've been following the Mahout mailing list for a while and look forward to seeing more machine learning algorithms implemented in mapreduce. It is a young project, but I'm interested to see how it develops.
This post was extremely short since I wanted to get the announcement out quickly, I'll have a more to post about machine learning with Hadoop soon. Until then, check out http://delicious.com/pskomoroch/hadoop+machinel... and http://delicious.com/pskomoroch/mapreduce+machi...>
Here's an idea for a streaming hack I mentioned on the forums. There is no built in support for pulling data from EBS volumes instead of S3 and boto isn't installed, but you you could use the distributed cache to install boto, mount an EBS public dataset from within the first mapper, and load it onto hdfs. Something like this:
1) send a zipped boto src directory, renamed with a .mod extension over the distributed cache
2) Add an intial dummy JobFlow step where the mapper opens boto.mod, reads in your credentials, imports boto, and mounta the EBS snapshot using your AWS credentials
mount public dataset via boto:
ec2-attach-volume vol-e84aae81 -i i-a09104c9 -d /dev/sdf
mkdir /mnt/genbank
mount /dev/sdf /mnt/genbank
3) after the volume is mounted, shell out within the mapper and copy the data from the EBS volume to HDFS:
hadoop fs -put /mnt/genbank /home/hadoop/genbank
4) proceed with the next JobFlow step, using hdfs:///home/hadoop/genbank as input
Peter,
Great stuff - thanks very much for posting this.
We ahve 100GB-multi TB input files required for various none-java analysis applications. Would you recommend we use the same approach so beautifully presented above and in the python tutorial - or do you recommend we take a different approach to making the input data available?
Peter
Peter T,
Good question, I think this is something Amazon needs to sort out in general. For elastic mapreduce (EMR) it is easier to work with chunked S3 files at the moment, so I just run a script to copy the raw data up to s3 in chunks using s3cmd. Outside of EMR I usually use single EBS volumes for datasets < 1 TB, since they can be directly mounted on EC2 and are easier to work with. I haven't tried the EBS + EMR hack yet, but will let you know how that goes if I try it.
Contact me with the exact data sizes you are dealing with and maybe I can suggest an approach...
-Pete