一些hadoop入门小demo

最新推荐文章于 2025-10-22 08:56:02 发布

原创最新推荐文章于 2025-10-22 08:56:02 发布 · 5.5k 阅读

9 ·

CC 4.0 BY-SA版权

hadoop 专栏收录该内容

16 篇文章

订阅专栏

本文是作者学习Hadoop的初步尝试，通过一个简单的小demo演示如何使用Hadoop找出给定文件中每年的最高气温。作者首先创建了一个Maven项目，配置了必要的依赖，然后编写了TempMapper和TempReducer类，分别处理map和reduce任务。在map阶段，从输入文本中提取年份和温度，reduce阶段则集中相同年份的气温并找出最高值。整个过程突出了Hadoop的map/reduce架构，强调在map阶段仅处理数据格式，而在reduce阶段进行聚合操作。最后，作者指出map输出可以在Hadoop的日志文件中找到。

lz最近在研究hadoop，刚在入门阶段，对一些高深的知识点还不在行，但是我希望能够通过自己点点滴滴的学习总结，在日复一日的过程中，逐渐积累相关的学习经验，从而能够逐步成为hadoop方面、spark甚至是大数据方面的专家。

那么只能先从一些小的demo开始学些，这里要介绍的一个小demo是在给定的一个文件中，求出每年的最高气温。

这里的数据，前4位表示的是年份，中间四位表示的是月和日，最后两位表示的是当天的温度。那么现在要利用hadoop来实现在这些所有的日期里的温度，选取最高的温度。

首先，通过新建一个maven项目，那么需要更新pom.xml文件，写入依赖的jar包。

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.1</version>
        </dependency>
        <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.7.1</version>
    </dependency>
        <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.7.1</version>
    </dependency>
        <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-common</artifactId>
        <version>2.7.1</version>
    </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.1</version>
        </dependency>
    </dependencies>

以上就是一般的hadoop项目所需要依赖的最基本的jar包。然后就是主体代码部分：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Created by sunwangdong on 2017/3/19.
 */
public class Temperature
{
    public static class TempMapper extends Mapper<LongWritable,Text,Text,IntWritable>
    {
        public void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException
        {
            System.out.print("Before Mapper:" + key + "," + value);
            String line = value.toString();
            String year = line.substring(0,4);
            int temperature = Integer.parseInt(line.substring(8));
            context.write(new Text(year),new IntWritable(temperature));
            System.out.println("======" + "After Mapper:" + new Text(year) + ", " + new IntWritable(temperature));
        }
    }
    public static class TempReducer extends Reducer<Text,IntWritable,Text,IntWritable>
    {
        public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException,InterruptedException
        {
            int maxValue = Integer.MIN_VALUE;
            StringBuffer sb = new StringBuffer();
            for(IntWritable value : values)
            {
                maxValue = Math.max(maxValue,value.get());
                sb.append(value).append(", ");
            }
            System.out.print("Before Reduce: " + key + ", " + sb.toString());
            context.write(key,new IntWritable(maxValue));
            System.out.println("======" + "After Reduce: " +
            key + ", " + maxValue);
        }
    }
    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf,"Temp");
        job.setJarByClass(Temperature.class);
        job.setMapperClass(TempMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setReducerClass(TempReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job,new Path(args[0]));
        //设置文件的输入路径
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //设置文件的输出路径

        Boolean b = job.waitForCompletion(true);           //提交任务
        if(!b)
        {
            System.err.println("failed");
        }
        else
            System.out.println("finished!");
    }
}

注意，lz这里是通过idea自带的工具来进行打jar包，但是打完jar包后，直接运行hadoop jar会出现错误，需要用

zip -d ****.jar META-INF/LICENSE

命令来删除打完jar包中的license文件，然后运行

hadoop jar ****.jar com.***.Temperature(这里必须要加上package名字，否则hadoop找不到需要运行的job名，也就是类，会报出class not found exception) 数据位置输出位置

这里是分别是两个函数TempMapper和TempReducer，分别表示的是对于统计气温最高值的map和reduce任务。在map任务中，根据输入文本的特点，先取出对应的年份和温度，所以用了String的substring函数来实现，并以此作为reducer的输入文本。注意context的write函数的参数。在reducer函数中，可以发现，利用map函数所得到的内容，就可以将相同年份下的分散的气温集中到一块儿。实现reduce的功能。

从这个例子中，我们可以发现，在利用map/reduce的结构中，只需要写明白数据输入的对应格式和做的事儿，在map进行汇总的时候，并不需要写汇总的逻辑代码，也只是需要通过在reduce时遍历得到的结果，完成相关的输出结构即可。所以总体的map/reduce结构的代码都是这样的思路。在main函数端，这里的FileInputFormat和FileOutputFormat的输入与输出的路径都是针对hdfs下的路径而言，且都是用hdfs dfs -put的命令将上述文件进行上传的。

最后的结果显示如下：

其中，上述代码中，在map代码中有一些输出，那么这些输出，可以在日志文件中找到，具体的位置在hadoop文件夹中的logs中的userlogs文件夹内，可以通过http://localhost:8088查看具体执行的那一次作业的编号，然后通过此编号来查找日志输入，其中有stdout文件的内容就是system.out执行输出的文件内容。

例如，上述例子的中间结果输出如下：

Before Mapper:0,2014010114======After Mapper:2014, 14
Before Mapper:11,2014010216======After Mapper:2014, 16
Before Mapper:22,2014010317======After Mapper:2014, 17
Before Mapper:33,2014010410======After Mapper:2014, 10
Before Mapper:44,2014010506======After Mapper:2014, 6
Before Mapper:55,2012010609======After Mapper:2012, 9
Before Mapper:66,2012010732======After Mapper:2012, 32
Before Mapper:77,2012010812======After Mapper:2012, 12
Before Mapper:88,2012010919======After Mapper:2012, 19
Before Mapper:99,2012011023======After Mapper:2012, 23
Before Mapper:110,2001010116======After Mapper:2001, 16
Before Mapper:121,2001010212======After Mapper:2001, 12
Before Mapper:132,2001010310======After Mapper:2001, 10
Before Mapper:143,2001010411======After Mapper:2001, 11
Before Mapper:154,2001010529======After Mapper:2001, 29
Before Mapper:165,2013010619======After Mapper:2013, 19
Before Mapper:176,2013010722======After Mapper:2013, 22
Before Mapper:187,2013010812======After Mapper:2013, 12
Before Mapper:198,2013010929======After Mapper:2013, 29
Before Mapper:209,2013011023======After Mapper:2013, 23
Before Mapper:220,2008010105======After Mapper:2008, 5
Before Mapper:231,2008010216======After Mapper:2008, 16
Before Mapper:242,2008010337======After Mapper:2008, 37
Before Mapper:253,2008010414======After Mapper:2008, 14
Before Mapper:264,2008010516======After Mapper:2008, 16
Before Mapper:275,2007010619======After Mapper:2007, 19
Before Mapper:286,2007010712======After Mapper:2007, 12
Before Mapper:297,2007010812======After Mapper:2007, 12
Before Mapper:308,2007010999======After Mapper:2007, 99
Before Mapper:319,2007011023======After Mapper:2007, 23
Before Mapper:330,2010010114======After Mapper:2010, 14
Before Mapper:341,2010010216======After Mapper:2010, 16
Before Mapper:352,2010010317======After Mapper:2010, 17
Before Mapper:363,2010010410======After Mapper:2010, 10
Before Mapper:374,2010010506======After Mapper:2010, 6
Before Mapper:385,2015010649======After Mapper:2015, 49
Before Mapper:396,2015010722======After Mapper:2015, 22
Before Mapper:407,2015010812======After Mapper:2015, 12
Before Mapper:418,2015010999======After Mapper:2015, 99
Before Mapper:429,2015011023======After Mapper:2015, 23

上述就是mapper函数中，采用system.out函数输出的内容。