配置hadoop+pyspark环境
1、部署hadoop环境
配置hadoop伪分布式环境,所有服务都运行在同一个节点上。
1.1、安装JDK
安装jdk使用的是二进制免编译包,下载页面
- 下载jdk
$ cd /opt/local/src/
$ curl -o jdk-8u171-linux-x64.tar.gz http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz?AuthParam=1529719173_f230ce3269ab2fccf20e190d77622fe1
- 解压文件,配置环境变量
### 解压到指定位置
$ tar -zxf jdk-8u171-linux-x64.tar.gz -C /opt/local
### 创建软连接
$ cd /opt/local/
$ ln -s jdk1.8.0_171 jdk
### 配置环境变量,在当前用的配置文件 ~/.bashrc 增加如下配置
$ tail ~/.bashrc
# Java
export JAVA_HOME=/opt/local/jdk
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
- 刷新环境变量
$ source ~/.bashrc
### 演那种是否生效,返回java信息说明正确
$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
1.2、配置/etc/hosts
### 配置/etc/hosts 把主机名和IP地址一一对应
$ head -n 3 /etc/hosts
# ip --> hostname or domain
192.168.20.10 node
### 验证
$ ping node -c 2
PING node (192.168.20.10) 56(84) bytes of data.
64 bytes from node (192.168.20.10): icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from node (192.168.20.10): icmp_seq=2 ttl=64 time=0.040 ms
1.3、设置ssh无密码登录
- 生成SSH key
### 生成ssh key
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
- 配置公钥到许可文件authorizd_keys
### 需要输入密码
ssh-copy-id node
### 验证登录,不需要密码即为成功
$ ssh node
1.4、安装配置hadoop
- 下载hadoop
### 下载Hadoop2.7.6
$ cd /opt/local/src/
$ wget -c http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
- 创建hadoop相关目录
$ mkdir -p /opt/local/hdfs/{namenode,datanode,tmp}
$ tree /opt/local/hdfs/
/opt/local/hdfs/
├── datanode
├── namenode
└── tmp
- 解压hadoop安装文件
### 解压到指定位置
$ cd /opt/local/src/
$ tar -zxf hadoop-2.7.6.tar.gz -C /opt/local/
### 创建软连接
$ cd /opt/local/
$ ln -s hadoop-2.7.6 hadoop
1.5、配置hadoop
1.5.1、 配置core-site.xml
$ vim /opt/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/local/hdfs/tmp/</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
1.5.2、 配置hdfs-site.xml
$ vim /opt/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/local/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/local/hdfs/datanode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
1.5.3、 配置mapred-site.xml
### mapred-site.xml需要从一个模板拷贝在修改
$ cp /opt/local/hadoop/etc/hadoop/mapred-site.xml.template /opt/local/hadoop/etc/hadoop/mapred-site.xml
$ vim /opt/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>node:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/history/done_intermediate</value>
</property>
</configuration>
1.5.4、 配置yarn-site.xml
$ vim /opt/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>