看过很多东西,都忘的差不多了, 记录下便于回忆。
基础复习
hadoop安装
这部分我是用homebrew安装的,没什么好说的,主要就是hadoop相关的一些配置。几个需要注意的配置如下:
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
yarn-site.xml
hadoop分namenode和datanode,简单的说namenode负责存储目录结构,用于管理datanode。安装完hadoop之后需要格式化namenode。
$ hadoop namenode -format
-
接下来需要启动hadoop服务,这里分两种:
- dfs + map reduce => 启动namenode(主从), datanode, jobtracker, tasktracker。50030端口监测jobtracker,50070端口监测namenode。
- dfs + yarn => 启动yarn资源管理器和节点管理,端口8088查看资源管理器。
$ start-dfs.sh
$ start-mapred.sh
$ start-yarn.sh
$ start-all.sh
HDFS
概念:
概念 | 相关 |
---|---|
数据块 | $ hadoop fsck / -files -blocks |
namenode | namenode镜像+log, 两种容错机制(备份+辅助namenode) |
datanode | 根据需求存储,并向namenode发送块列表 |
联邦HDFS | namenode分卷管理 |
namenode恢复 | namenode镜像导入->重做编辑日志->接受datanode |
文件系统 | hadoop.fs.FileSystem的实现 |
简单操作
$ hadoop fs -copyFromLocal xx.txt hdfs://localhost/test/xx.txt
$ hadoop fs -copyFromLocal xx.txt /test/xx.txt //core-site指定URI
$ hadoop fs -copyToLocal hdfs://localhost/test/xx.txt xx.txt
$ hadoop fs -put xx.txt /test/xx.txt
$ hadoop fs -mkdir /dir
$ hadoop fs -ls .
$ hadoop fs -ls file:///
实践,使用IntelliJ
原理很简单,本地生成main函数,实现map和reduce,生成jar包,导入hdfs并执行。
$ hadoop jar rudi-hadoop_main.jar /input/test /output/result
// 这里rudi-hadoop_main.jar为本地jar包。
这里需要说的是map任务可在本地节点运行,但reduce需要整个map输出,map导入reduce如果优化不好会过多占用集群带宽,可使用combiner做嫁接以节省资源。
Word Count:
public class WordCount {
public static class WordCountMap extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens()) {
word.set(token.nextToken());
context.write(word, one);
}
}
}
public static class WordCountReduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMap.class);
job.setReducerClass(WordCountReduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
FileSystem:
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception{
InputStream in = null;
try{
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096,false);
}finally {
IOUtils.closeStream(in);
}
}
}
在使用IntellJ生成jar包的时候,需要主要一点,manifest文件不要放在系统指定的目录中,不然生成的jar包会缺失该文件导致不能执行