hadoop——如何建立一个MapReduce模板；自定义Reduce输出数据类型；设计MapReduce程序；网站日志分析指标

本文链接：https://blog.youkuaiyun.com/h1025372645/article/details/94589315

本文详细介绍MapReduce程序的设计与实现过程，包括Driver类的三种实现方式，Mapper和Reducer类的模板代码，以及Hadoop数据类型和自定义数据类型的创建方法。同时，文章还讲解了数据ETL过程在MapReduce中的应用，以及如何通过Map和Reduce阶段进行数据处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MapReduce运行主类——Driver类：

实现方法：

（1）不继承也不实现：

即运行主类不继承和实现任何类与接口

（2）继承和实现：官方推荐

extends Configured implements Tool

（3）不继承只实现接口

implements Tool

企业用的比较多，一些Configured都需要自定义，一些配置属性的值，可以通过configured去配置

模板代码——MRDriver：

import com.huadian.bigdata.mapreduce.WordCountMapReduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MRDriver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {

        //1.创建Job
        Job job = Job.getInstance( this.getConf(), "XXXX" );
        job.setJarByClass( MRDriver.class );
        //2.设置job
        //（2.1）input
        Path inputPath  = new Path( args[0] );
        FileInputFormat.setInputPaths( job,inputPath );
        //（2.2）map
        job.setMapperClass( null );
        job.setMapOutputKeyClass( null );
        job.setMapOutputValueClass( null );
        //（2.3）shuffle
        job.setPartitionerClass( null );//分区
        job.setGroupingComparatorClass( null );//分组
        job.setSortComparatorClass( null );//排序

        //（2.4）reduce
        job.setReducerClass( null );
        job.setOutputKeyClass(  null );
        job.setOutputValueClass( null );
        //job.setNumReduceTasks( 2 );
        //（2.5）output
        //如果输出目录存在，干掉
        FileSystem hdfs = FileSystem.get( this.getConf() );
        Path outputPath  = new Path( args[1] );
        if(hdfs.exists( outputPath )){
            hdfs.delete(  outputPath ,true);
        }
        FileOutputFormat.setOutputPath( job,outputPath );

        //3.提交运行
        boolean isSuccess = job.waitForCompletion( true );
        return isSuccess?0:1;
    }

    public static void main(String[] args) {
        Configuration configuration = new Configuration();
        //run(Configuration conf, Tool tool, String[] args)
        try {
            int status = ToolRunner.run( configuration, new MRDriver(), args );
            System.exit( status );
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

模板代码——MRMapper：



import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MRMapper extends Mapper<LongWritable,Text,Text,Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //todo:实现对应业务
    }
}

模板代码——MRReducer：



import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MRReducer extends Reducer<Text,Text,Text,Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        //todo:实现对应业务
    }
}

hadoop——数据类型

Hadoop的数据类型和我们Java将数据类型是一样的，只不过Hadoop重新封装了一些类型

这些类型和Java的包装类非常非常相似。

数据类型的转换

Hadoop类型 -> get方法 Java类型的转换

Hadoop类型 <- set方法 Java类型的转换

自定义数据类型：可修改输出格式

（1）创建类实现WritableComparable或者Writable

（2）根据需要定义属性，生成get/set

（3）构造：空参，带参数

（4）序列化和反序列方法实现

（5）比较方法compareTo

（6）toString方法的实现

具体实现：

package com.user_test_writble;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class UserTestWritable  implements WritableComparable<UserTestWritable> {
    private String firstKey;
    private int    secondKey;

    public UserTestWritable(String firstKey, int secondKey) {
        this.firstKey = firstKey;
        this.secondKey = secondKey;
    }

    public UserTestWritable() {
    }

    @Override
    public int compareTo(UserTestWritable o) {
        int i = this.getFirstKey().compareTo(o.getFirstKey());
        if(i==0){
            return Integer.parse(this.getSecondKey).equals(Integer.parse(o.getSecondKey));           
        }
        return i;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(firstKey);
        out.writeInt(secondKey);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.firstKey  = in.readUTF();
        this.secondKey = in.readShort();
    }

    public String getFirstKey() {
        return firstKey;
    }

    public void setFirstKey(String firstKey) {
        this.firstKey = firstKey;
    }

    public int getSecondKey() {
        return secondKey;
    }

    public void setSecondKey(int secondKey) {
        this.secondKey = secondKey;
    }

    @Override
    public String toString() {
        return firstKey+"--"+secondKey;
    }
}

设计MapReduce程序

按照任务划分：只有map任务、map、reduce任务都有

map and reduce

数据ETL的过程：

ETL，是英文Extract-Transform-Load的缩写

用来描述将数据从来源端经过萃取（extract）、转置（transform）、加载（load）至目的端的过程。

map阶段:分片处理，将一个大任务拆分

-》数据过滤

-》数据补全

比如：根据IP得到省市区信息

-》字段格式化

对某个字段进行格式化：如时间

域名处理：URL地址得到域名

reduce阶段：合并处理

map

Sqoop（sql +hadoop）本质就是一个MapReduce程序，但是只有Map任务

网站日志分析

pv:pageview,浏览量

页面的浏览次数，衡量一个网站用户访问的页面数量

打开一个页面算1，打开多个累加

UV：Unique visitor独立访客数

1天有多少人访问了网站。

1天内一个人多次访问网站，算1

VV：Visit View,访客的访问次数

记录所有访客1天内访问网站的次数。

当访客完成浏览并关闭浏览器的时候，下次在打开，重新累加

打开网站 -》浏览A-》浏览B -》关闭浏览

IP:独立ip

值一天内使用不同IP地址访问网站的数量

统计指标的作用

让数据变现

具体的体现：

按钮的位置：

即当发布新活动时哪一个按钮最容易被用户所点击。

页面来源：

即在浏览器中向用户推广电商网站信息，通过上一个页面的来源，推断出用户是从哪一处推广页面进入

可以通过分析结果，分析出用户行为，爱好。