Hadoop学习笔记

最新推荐文章于 2024-07-11 16:44:30 发布

原创最新推荐文章于 2024-07-11 16:44:30 发布 · 278 阅读

0 ·

CC 4.0 BY-SA版权

技术专栏收录该内容

4 篇文章

订阅专栏

本文详细介绍了Hadoop单机模式及伪分布式模式的搭建步骤，包括环境准备、安装配置、验证测试等关键环节。

最近在学习Hadoop,遇到不少问题还好都一一解决了，希望记录下来以后可以查看。

1.首先最基本的就是搭Hadoop环境，因为资源问题暂时只尝试单机模式和伪分布式模式。

首先在我的Win7上安装Orace VBox,然后在上面安装一个Ubuntu 14.04 Kylin，然后再安装Hadoop 2.6.0

单机模式

单机模式比较简单，安装Java, OpenSSH-Server然后修改 hadoop-env.sh就可以在单机上运行了。

1.添加hadoop用户到系统用户

安装前要做一件事——添加一个名为hadoop到系统用户，专门用来做Hadoop测试。

[html]view plaincopy 
    
 ~$ sudo addgroup hadoop  
 ~$ sudo adduser --ingroup hadoop hadoop  

现在只是添加了一个用户hadoop，它并不具备管理员权限，因此我们需要将用户hadoop添加到管理员组：

[html]view plaincopy 
    
 ~$ sudo usermod -aG admin hadoop  

2.安装ssh

由于Hadoop用ssh通信，先安装ssh

[html]view plaincopy 
    
 ~$ sudo apt-get install openssh-server  

ssh安装完成以后，先启动服务：

[html]view plaincopy 
    
 ~$ sudo /etc/init.d/ssh start   

启动后，可以通过如下命令查看服务是否正确启动：

[html]view plaincopy 
    
 ~$ ps -e | grep ssh  

作为一个安全通信协议，使用时需要密码，因此我们要设置成免密码登录，生成私钥和公钥：

[html]view plaincopy 
    
 hadoop@scgm-ProBook:~$ ssh-keygen -t rsa -P ""  

因为已有私钥，所以会提示是否覆盖当前私钥。第一次操作时会提示输入密码，按Enter直接过，这时会在～/home/{username}/.ssh下生成两个文件：id_rsa和id_rsa.pub，前者为私钥，后者为公钥，现在我们将公钥追加到authorized_keys中（authorized_keys用于保存所有允许以当前用户身份登录到ssh客户端用户的公钥内容）：

[html]view plaincopy 
    
 ~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys  

现在可以登入ssh确认以后登录时不用输入密码：

[html]view plaincopy 
    
 ~$ ssh localhost  

登出：

[html]view plaincopy 
    
 ~$ exit  

第二次登录：

[html]view plaincopy 
    
 ~$ ssh localhost  

登出：

[html]view plaincopy 
    
 ~$ exit  

这样以后登录就不用输入密码了。

3.安装Java

[html]view plaincopy 
    
 ~$ sudo apt-get install openjdk-6-jdk  
 ~$ java -version  

4.安装hadoop

到官网下载hadoop源文件

解压并放到你希望的目录中。我是放到/usr/local/hadoop

 
    ~$ sudo chown -R hadoop:hadoop /usr/local/hadoop 
   

5.设定hadoop-env.sh(Java 安装路径)

进入hadoop目录，打开conf目录下到hadoop-env.sh，添加以下信息：
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk (视你机器的java安装路径而定)
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:/usr/local/hadoop/bin

并且，让环境变量配置生效source

[html]view plaincopy 
    
 ~$ source /usr/local/hadoop/conf/hadoop-env.sh  

至此，hadoop的单机模式已经安装成功。

于是，运行一下hadoop自带的例子WordCount来感受以下MapReduce过程：

在hadoop目录下新建input文件夹

[html]view plaincopy 
    
 ~$ mkdir input  

将conf中的所有文件拷贝到input文件夹中

[html]view plaincopy 
    
 ~$ cp conf/* input<span style="font-family: Arial, Helvetica, sans-serif; white-space: normal; background-color: rgb(255, 255, 255); "> </span>  

运行WordCount程序，并将结果保存到output中

[html]view plaincopy 
    
 ~$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output  

运行

[html]view plaincopy 
    
 ~$ cat output/*  

你会看到conf所有文件的单词和频数都被统计出来。

伪分布式模式

伪分布式模式需要修改 core-site.xml,hdfs-site.xml,mapred-site.xml文件.

注意网上不少用的还是hadoop 1.x版本，hadoop 2.6.0有了一些新变化，比如"hadoop dfs -xxx"命令已经不再推荐使用，而是使用"hdfs dfs -xxx"

etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.

Format the filesystem:
```
  $ bin/hdfs namenode -format
```
Start NameNode daemon and DataNode daemon:
```
  $ sbin/start-dfs.sh
```
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
Browse the web interface for the NameNode; by default it is available at:
- NameNode - http://localhost:50070/

Make the HDFS directories required to execute MapReduce jobs:

  $ bin/hdfs dfs -mkdir /user
  $ bin/hdfs dfs -mkdir /user/<username>

Copy the input files into the distributed filesystem:
```
  $ bin/hdfs dfs -put etc/hadoop input
```

Run some of the examples provided:

  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'

Examine the output files:
Copy the output files from the distributed filesystem to the local filesystem and examine them:
```
  $ bin/hdfs dfs -get output output
  $ cat output/*
```
or

View the output files on the distributed filesystem:
```
  $ bin/hdfs dfs -cat output/*
```
When you're done, stop the daemons with:
```
  $ sbin/stop-dfs.sh
```