全文作为个人记录用。不做任何参考。
环境搭建参考:http://www.ityouknow.com/hadoop/2017/07/24/hadoop-cluster-setup.html
1、环境搭建
总共选择了3台虚拟机作为本次的主角
1.1、首先是虚拟机的安装,物理主机是win10,虚拟机用的是Centos7,采用最小化方式安装,安装完后,有可能需要激活网卡,修改/etc/sysonfig/network-scripts/ifcfg-xxxx(我的是ifcfg-ens33),将ONBOOT=no修改为yes,使得能够联网。如下所示:
1.2、依次安装完3台虚拟机后,再修改主机的名字,依次为 master、slave1、slave2。修改文件/etc/sysconfig/network,在master机器中加入:HOSTNAME=master,其他机器中依次加入 HOSTNAME=slave1,HOSTNAME=slave2.
1.3、修改三台机器的hosts 加入下面这段话(具体ip视自己的机子而定):
1.4、软件的安装,首先是jdk的安装。
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
1.5、免密登陆
免密登陆的思想:
A机器能够免密登陆B机器。
首先在A机器上生成密钥:
ssh-keygen -t rsa
然后将密钥拷贝到B机器的authorized_keys中,就可以了。
这里以master远程免密登陆slave1为例。
①、登陆master,执行 ssh-keygen -t rsa ,可以一路回车。
②、登陆slave1,执行 scp root@master:~/.ssh/id_rsa.pub /root/
③、在slave1上,执行 cat /root/id_rsa.pub >> ~/.ssh/authorized_keys。
(如果失败,执行一下 chmod 600 .ssh/authorized_keys)
④、在master上测试 ssh slave1,能够登陆则成功。
然后依次配置三台机器之间的免密登陆和本机的免密登陆(例如在master 中执行 ssh master,可以登陆)
1.6 Hadoop配置
export PATH=$PATH:$HADOOP_HOME/bin
依次修改 hadoop的配置文件,在hadoop的安装目录下的/etc/hadoop中。
<property>
<name>hadoop.tmp.dir</name>
<value>file:/root/hadoop-2.7.5/tmp</value><!--修改为自己hadoop的安装目录下的tmp-->
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value><!--这里的master名字和主节点名字一样,如果主节点不叫master,就换掉-->
</property>
</configuration>
<name>dfs.replication</name>
<value>2</value><!--根据实际情况定-->
</property>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop-2.7.5/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop-2.7.5/hdfs/data</value>
</property>
</configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://master:9001</value>
</property>
</configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
## 添加
slave1
slave2
1.7 Hadoop启动
1.7.1 格式化HDFS文件系统
bin/hadoop namenode -format(hadoop目录下执行)
1.7.2 启动hadoop
sbin/start-all.sh
1.8、可能出现的问题
## 配置项
export JAVA_HOME=你的jdk路径
1.9、词频程序
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.edu.bupt.wcy</groupId>
<artifactId>wordcount</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>wordcount</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
</project>
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
//super.map(key, value, context);
//String[] words = StringUtils.split(value.toString());
String[] words = StringUtils.split(value.toString(), " ");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
reducer:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text arg0, Iterable<LongWritable> arg1,
Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
//super.reduce(arg0, arg1, arg2);
int sum=0;
for(LongWritable num:arg1)
{
sum += num.get();
}
context.write(arg0,new LongWritable(sum));
}
}
runner:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCountRunner {
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordCountRunner.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
打包成jar包后,放到集群上运行。