txAWS

Support for Amazon's MapReduce in the Cloud

Registered by Duncan McGreggor on 2009-11-17

txAWS wants to provide a full cloud API for developers who need the benefits of async programming in their applications and/or scripts. Providing support for MapReduce (via Amazon's Hadoop) in txAWS is part of this effort.

Here are the basic steps as outlined by Amazon (edited extensively):

* Develop your data processing application. Amazon Elastic MapReduce enables job flows to be developed. There is a Python sample application called "similarity", and this might be a good place to check out the workflow involved in using Amazon's mapreduce. Here's the URL:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2274&categoryID=263

* Upload your data and your processing application into Amazon S3. Amazon S3 provides reliable, scalable, easy-to-use storage for your input and output data.

* Start an Amazon Elastic MapReduce “job flow” (using the txAWS API). You will need to choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3 and start the flow.

* Monitor the progress of your job flow(s) from the txAWS API. After the job flow is done, retrieve the output from Amazon S3.