全局计数器
计数器是用于记录job运行状态和进度的 类似于job运行的一个报告。 统计job运行过程中的各个参数,包括job的输入数据量 输出数据量,map输入的数据条数,reduce分组数等等。
其作用范围是全局的,假设运行3个maptask任务, 计数器统计的是3个maptask任务的总和的
内置计数器
Hadoop其实内置了很多计数器。我们先看下运行一个mr程序出来的报告。
例如:
2018-07-23 20:55:43,336 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1080)) - Final Counters for attempt_local445845887_0001_m_000000_0: Counters: 17//计数器总数
File System Counters
FILE: Number of bytes read=468
FILE: Number of bytes written=293975
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=16
Map output records=16
Map output bytes=352
Map output materialized bytes=390
Input split bytes=101
Combine input records=0
Spilled Records=16
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
Total committed heap usage (bytes)=268435456
File Input Format Counters
Bytes Read=311
自定义计数器:
1.例如需要统计缺失数据字段的条数或者是有某一个标签的条数的时候就需要用到全局统计。
例如:500M的文件分成4block,相应的有4maptask任务。要统计其中所有的包含“北京”字段的数据条数,需要统计全局4个maptask所有的数据。
自定义计数器示例
统计flow.txt中所有的手机号中以‘139’开头的数据条数
public enum MyCounter {
//定义一个对象 统计139开头的数据条数 每次遇到手机号为139开头就++
COUNT_139_START
}
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Counter_139_Lines {
//这件事情由map端做,数据是每一条取一次
static class MyMapper extends Mapper<LongWritable, Text,
NullWritable, NullWritable>{
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
//先获取每一条数据
String line = value.toString();
String[] infos = line.split("\t");
String phone=infos[1];
//判断是否是139开始的,如果是,取出计数器+1
if(phone.startsWith("139")){
//通过context取全局计数器
//获取计数器 没有给初始值 默认从0开始 setValue(long value);
//自增的方法 参数:代表自增的步长
void increment(long incr);
Counter counter = context.getCounter(MyCounter.COUNT_139_START);
//将计数器 进行自增 1
counter.increment(1L);
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.setProperty("HADOOP_USER_NAME", "hadoop");
Configuration conf=new Configuration();
Job job=Job.getInstance(conf);
job.setJarByClass(Counter_139_Lines.class);
job.setMapperClass(MyMapper.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(NullWritable.class);
//设置reducetask的个数 不设置默认1
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path("hdfs://hadoop01:9000/flowin"));
//输出路径 不需要 ???? 目的:存储成功的标志文件
FileOutputFormat.setOutputPath(job, new Path("hdfs://hadoop01:9000/flowcountout02"));
job.waitForCompletion(true);
}
}
运行结果:
File System Counters
FILE: Number of bytes read=167
FILE: Number of bytes written=293093
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=30880080
HDFS: Number of bytes written=0
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Map-Reduce Framework
Map input records=304920
Map output records=0
Input split bytes=101
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=41
Total committed heap usage (bytes)=215482368
//自定义的计数器
com.ghgj.cn.counter.MyCounter
COUNT_139_START=55440
File Input Format Counters
Bytes Read=30880080
File Output Format Counters
Bytes Written=0