In this excerpt from the O'Reilly publication Hadoop: The Definitive Guide, Second Edition, we look at running Hadoop on Amazon EC2, which is a great way to try out your own Hadoop cluster on a low-commitment, trial basis.
Although many organizations choose to run Hadoop in-house, it is also popular to run Hadoop in the cloud on rented hardware or as a service. For instance, Cloudera offers tools for running Hadoop in a public or private cloud, and Amazon has a Hadoop cloud service called Elastic MapReduce.
Hadoop on Amazon EC2
Amazon Elastic Compute Cloud (EC2) is a computing service that allows customers to rent computers (instances) on which they can run their own applications. A customer can launch and terminate instances on demand, paying by the hour for active instances.
The Apache Whirr project ( http://incubator.apache.org/whirr) provides a set of scripts that make it easy to run Hadoop on EC2 and other cloud providers. [77] The scripts allow you to perform such operations as launching or terminating a cluster, or adding instances to an existing cluster.
[77] There are also bash scripts in the src/contrib/ec2 subdirectory of the Hadoop distribution, but these are deprecated in favor of Whirr. In this section, we use Whirr’s Python scripts (found in contrib/python), but note that Whirr also has Java libraries that implement similar functionality.
Running Hadoop on EC2 is especially appropriate for certain workflows. For example, if you store data on Amazon S3, then you can run a cluster on EC2 and run MapReduce jobs that read the S3 data and write output back to S3, before shutting down the cluster. If you’re working with longer-lived clusters, you might copy S3 data onto HDFS running on EC2 for more efficient processing, as HDFS can take advantage of data locality, but S3 cannot (since S3 storage is not collocated with EC2 nodes).
Setup
Before you can run Hadoop on EC2, you need to work through Amazon’s Getting Started Guide (linked from the EC2 websitehttp://aws.amazon.com/ec2/), which goes through setting up an account, installing the EC2 command-line tools, and launching an instance.
Next, install Whirr, then configure the scripts to set your Amazon Web Service credentials, security key details, and the type and size of server instances to use. Detailed instructions for doing this may be found in Whirr’s README file.
Launching a cluster
We are now ready to launch a cluster. To launch a cluster named test-hadoop-cluster with one master node (running the namenode and jobtracker) and five worker nodes (running the datanodes and tasktrackers), type:
% hadoop-ec2 launch-cluster test-hadoop-cluster 5
This will create EC2 security groups for the cluster, if they don’t already exist, and give the master and worker nodes unfettered access to one another. It will also enable SSH access from anywhere. Once the security groups have been set up, the master instance will be launched; then, once it has started, the five worker instances will be launched. The reason that the worker nodes are launched separately is so that the master’s hostname can be passed to the worker instances, and allow the datanodes and tasktrackers to connect to the master when they start up.
To use the cluster, network traffic from the client needs to be proxied through the master node of the cluster using an SSH tunnel, which we can set up using the following command:
% eval 'hadoop-ec2 proxy test-hadoop-cluster'
Proxy pid 27134
Running a MapReduce job
You can run MapReduce jobs either from within the cluster or from an external machine. Here we show how to run a job from the machine we launched the cluster on. Note that this requires that the same version of Hadoop has been installed locally as is running on the cluster.
When we launched the cluster, a hadoop-site.xml file was created in the directory ~/.hadoop-cloud/test-hadoop-cluster. We can use this to connect to the cluster by setting the
HADOOP_CONF_DIR
environment variable as follows:
% export HADOOP_CONF_DIR=~/.hadoop-cloud/test-hadoop-cluster
The cluster’s filesystem is empty, so before we run a job, we need to populate it with data. Doing a parallel copy from S3 using Hadoop’s distcp tool is an efficient way to transfer data into HDFS:
% hadoop distcp s3n://hadoopbook/ncdc/all input/ncdc/all
After the data has been copied, we can run a job in the usual way:
% hadoop jar job.jar MaxTemperatureWithCombiner input/ncdc/all output
Alternatively, we could have specified the input to be S3, which would have the same effect. When running multiple jobs over the same input data, it’s best to copy the data to HDFS first to save bandwidth:
% hadoop jar job.jar MaxTemperatureWithCombiner s3n://hadoopbook/ncdc/all output
You can track the progress of the job using the jobtracker’s web UI, found at
http://master_host:50030/
. To access web pages running on worker nodes, you need set up a proxy auto-config (PAC) file in your browser. See the Whirr documentation for details on how to do this.
Terminating a cluster
To shut down the cluster, issue the
terminate-cluster
command:
% hadoop-ec2 terminate-cluster test-hadoop-cluster
You will be asked to confirm that you want to terminate all the instances in the cluster.
Finally, stop the proxy process (the
HADOOP_CLOUD_PROXY_PID
environment variable was set when we started the proxy):
% kill $HADOOP_CLOUD_PROXY_PID
Learn more about this topic from Hadoop: The Definitive Guide, Second Edition.
Apache Hadoop is ideal for organizations with a growing need to store and process massive application datasets. With Hadoop: The Definitive Guide, programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.