Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation isformatting the Hadoop filesystem which is implemented on top of thelocal filesystem of your “cluster” (which includes only your localmachine if you followed this tutorial). You need to do this thefirst time you set up a Hadoop cluster.
Do not format a running Hadoop filesystem as you will lose all thedata currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directoryspecified by the dfs.name.dir
variable),run the command
1
| hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
|
The output will look like this:
12345678910111213141516171819
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format 10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG: 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: hduser@ubuntu:/usr/local/hadoop$
|
Starting your single-node cluster
Run the command:
1
| hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
|
This will startup a Namenode, Datanode, Jobtracker and aTasktracker on your machine.
The output will look like this:
1234567
| hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.outstarting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.outhduser@ubuntu:/usr/local/hadoop$
|
A nifty tool for checking whether the expected Hadoop processes arerunning is jps
(partof Sun’s Java since v1.5.0). See also How todebug MapReduce programs.
1234567
| hduser@ubuntu:/usr/local/hadoop$ jps 2287 TaskTracker 2149 JobTracker 1938 DataNode 2085 SecondaryNameNode 2349 Jps 1788 NameNode
|
You can also check with netstat
ifHadoop is listening on the configured ports.
123456789101112
| hduser@ubuntu:~$ sudo netstat -plten | grep java tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/javatcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/javatcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java hduser@ubuntu:~$
|
If there are any errors, examine the log files inthe /logs/
directory.
Stopping your single-node cluster
Run the command
1
| hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
|
to stop all the daemons running on your machine.
Example output:
1234567
| hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode hduser@ubuntu:/usr/local/hadoop$
|
Running a MapReduce job
We will now run your first Hadoop MapReduce job. We will usethe WordCount examplejob which reads text files and counts howoften words occur. The input is text files and the output is textfiles, each line of which contains a word and the count of howoften it occurred, separated by a tab. More informationof what happens behind thescenes is available atthe Hadoop Wiki.
Download example input data
We will use three ebooks from Project Gutenberg for thisexample:
Download each ebook as text files in PlainText UTF-8
encoding and store the files ina local temporary directory of choice, forexample /tmp/gutenberg
.
123456
| hduser@ubuntu:~$ ls -l /tmp/gutenberg/ total 3604 -rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt -rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt -rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txthduser@ubuntu:~$
|
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
1
| hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
|
Copy local example data to HDFS
Before we run the actual MapReduce job, wefirst have tocopy the files from our local file system toHadoop’s HDFS.
12345678910
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 1 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg Found 3 items -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt hduser@ubuntu:/usr/local/hadoop$
|
Run the MapReduce job
Now, we actually run the WordCount example job.
1
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
|
This command will read all the files in the HDFSdirectory /user/hduser/gutenberg
,process it, and store the result in the HDFSdirectory /user/hduser/gutenberg-output
.
Note: Some people run the command above and get the following errormessage:
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar at org.apache.hadoop.util.RunJar.main (RunJar.java: 90) Caused by: java.util.zip.ZipException: error in opening zip file
In this case, re-run the command with the full name of the HadoopExamples JAR file, for example:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Example output of the previous command in the console:
1234567891011121314151617181920212223242526272829
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3 10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001 10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0% 10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0% 10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100% 10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001 10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17 10/05/08 17:43:28 INFO mapred.JobClient: Job Counters 10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1 10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=310/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330 10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290 10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286 10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934 10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796 10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290 10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874 10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267 10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187 10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286
|
Check if the result is successfully stored in HDFSdirectory /user/hduser/gutenberg-output
:
123456789
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 2 itemsdrwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-outputhduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output Found 2 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000hduser@ubuntu:/usr/local/hadoop$
|
If you want to modify some Hadoop settings on the fly likeincreasing the number of Reduce tasks, you can usethe "-D"
option:
1
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /user/hduser/gutenberg /user/hduser/gutenberg-output
|
An important note about
mapred.map.tasks:
Hadoopdoes not honor mapred.map.tasks
beyondconsidering it a hint. But it accepts the userspecified
mapred.reduce.tasks
anddoesn’t manipulate that. You cannotforce
mapred.map.tasks
butyou can specify
mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local filesystem. Alternatively, you can use the command
1
| hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
|
to read the file directly from HDFS without copying it to the localfile system. In this tutorial, we will copy the results to thelocal file system though.
1234567891011121314
| hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-outputhduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-outputhduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 1 "A_ 1 "Absoluti 1"Alack! 1 hduser@ubuntu:/usr/local/hadoop$
|
Note that in this specific output the quote signs (“) enclosing thewords in the head
outputabove have not been inserted by Hadoop. They are the result of theword tokenizer used in the WordCount example, and in this case theymatched the beginning of a quote in the ebook texts. Just inspectthe part-00000
filefurther to see it for yourself.
The command fs-getmerge will simply concatenate any filesit finds in the directory you specify. This means that the mergedfile might (and most likely will) notbe sorted.
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default(see conf/hadoop-default.xml
)available at these locations:
These web interfaces provide concise information about what’shappening in your Hadoop cluster. You might want to give them atry.
NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary includinginformation about total/remaining capacity, live and dead nodes.Additionally, it allows you to browse the HDFS namespace and viewthe contents of its files in the web browser. It also gives accessto the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.

JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general jobstatistics of the Hadoop cluster, running/completed/failed jobs anda job history log file. It also gives access to the ‘‘localmachine’s’’ Hadoop log files (the machine on which the web UI isrunning on).
By default, it’s available at http://localhost:50030/.

TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. Italso gives access to the ‘‘local machine’s’’ Hadoop log files.
By default, it’s available at http://localhost:50060/.

What’s next?
If you’re feeling comfortable, you can continue your Hadoopexperience with my follow-up tutorial RunningHadoop On Ubuntu Linux (Multi-NodeCluster) where I describe how to build aHadoop ‘‘multi-node’’ cluster with two Ubuntu boxes (this willincrease your current cluster size by 100%, heh).
In addition, I wrote atutorial on howto code a simple MapReduce job in the Pythonprogramming language which can serve as the basis for writing yourown MapReduce programs.
Related Links
From yours truly:
From other people:
Change Log
Only important changes to this article are listed here:
- 2011-07-17: Renamed the Hadoop userfrom
hadoop
to hduser
basedon readers’ feedback. This should make the distinction between thelocal Hadoop user (now hduser
),the local Hadoop group (hadoop
),and the Hadoop CLI tool (hadoop
)more clear.