For mahout, today I install Hadoop in my PC, here are the installation guide, hope useful :)
Required Software
1. Java 1.6.x
2. Cygwin: It is a is a Linux-like environment for Windows, and is Required for shell support in addition to the required software above.
3. SSH must be installed and SSHD must be running to use the Hadoop scripts that manage remote Hadoop daemons.
Install Cygwin:
1. download setup.exe from http://www.cygwin.com/
2. select install from internet, and sepecify the download folder and install folder, for downloading, please select a mirror nearby.
3. after the conponent list are downloaded, please search "SSH", it will in Net category, and change the default "Skip" into one version of SSH.
4. download and install the component.
After install, you can see a Cygwin.exe in your desktop, you can run the bash shell to perform Linux in our Windows enviornment.
Your linux filesystem is under %You_Cygwin_Install% folder.
SSH Configuration:
1. Add System Enviornment:
A.add a new system enviornment named as CYGWIN, it value is 'ntsec tty'.
B.edit system enviornment PATH, add your 'Cygwin/bin' folder into it.
2. Config SSH
A. change to bin folder: "cd /bin"
B. execute configuration command:"ssh-host-config -y",when "CYGWIN=" come up, please input "ntsec tty". After this, your SSH service is been started in window services, then please restart your computer.
C. change to home folder in your Cygwin install folder, you can see that a folder named as your window user account have been generated.
D. execute connect command: "ssh youname@127.0.0.1"
if you connect successfully, that means your configuration is correct. print out "Last login: Sun Jun 8 19:47:14 2008 from localhost"
if your connection fail, maybe you need add SSH permition in your firewall, the default port of SSH is 22.
E. if you need sepcify the password, you could using the following commands:
"ssh-keygen -t dsa -P '<your_password>' -f ~/.ssh/id_dsa"
"$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys"
then every time you are be asked to input the password when you trying to connect through SSH.
Hadoop Install and Configuration
1. Download Hadoop ".tar.gz" file, and extract them under your Cygwin file system, suggest: usr/local
2. Configure hadoop-env.sh under hadoop/conf folder
export JAVA_HOME=<Your Java Location> //put java under Cygwin is better to sepecify the location
export HADOOP_IDENT_STRING=MYHADOOP
After the configuration, you can use the following commands verify your installation.
cd /usr/local/hadoop
bin/hadoop version
It should print out:
Hadoop 0.17.0
Subversion http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523
Compiled by hadoopqa on Thu May 15 07:22:55 UTC 2008
3. Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
For the "Pseudo-Distributed Operation" mode, you need do the following configurations
A. in conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
B. in conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
C. in conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
4. Execution
A. Format a new distributed-filesystem:
$ bin/hadoop namenode -format
B. Start the hadoop daemons:
$ bin/start-all.sh your also can start them specifically, $> bin/start-dfs.sh and $> bin/start-mapred.sh
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to
${HADOOP_HOME}/logs).
C. Browse the web interface for the NameNode and the JobTracker; by default they are available at:
* NameNode - http://localhost:50070/
* JobTracker - http://localhost:50030/
D. Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
E. Run some of the examples provided:
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
F. Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
G. When you're done, stop the daemons with:
$ bin/stop-all.sh