【大数据笔记】--续谈WordCount的Bug

在之前的Blog [http://flyfoxs.iteye.com/blog/2110463]  中讨论了, hadoop在文件切割时,可能会把一个行数据切割成无意义的2块. 如果不做特别处理,这会造成数据的失真及处理错误. 经人指点,发现这个BUG不存在.

 

Hadoop在分割文件后,后期读取中会通过一些规则来保证不会出现把一行数据分割成2行. 下面对这个后期处理机制(LineRecordReader)做一个分析:

 

1)数据分割是由JobClient完成,不是在hadoop集群完成.(并且这个是一个粗分,具体精确的还是依赖Mapper依赖如下规则)

2)数据的分割是由JobClient完成,但是Mapper在处理的时候,不是严格按照这个来处理,

除了第一个Split,其他的Split都是从第一个换行符开始读取

Split的结束是下一个Split的换行符,(太霸道了,除了最后一个,几乎每一都要跨越Split) 

3)针对超长行,有一个理论上的Bug,就是如果有行超过了你限制的长度,那么这一行会有部分数据会被抛弃. 但是这个Bug是理论上的,因为默认值为  Integer.MAX_VALUE .

 

    this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",  Integer.MAX_VALUE);

 

 

下面的代码可以看出LineRecordReader读取最后一行的时候,并不是严格按照Split的结束而结束. 而是必须要读取到下一个Split的换行符.

代码比较复杂已经添加了注释,如果有不明白的欢迎提问.

 

  public int readLine(Text str, int maxLineLength,
                      int maxBytesToConsume) throws IOException {
    /* We're reading data from in, but the head of the stream may be
     * already buffered in buffer, so we have several cases:
     * 1. No newline characters are in the buffer, so we need to copy
     *    everything and read another buffer from the stream.
     * 2. An unambiguously terminated line is in buffer, so we just
     *    copy to str.
     * 3. Ambiguously terminated line is in buffer, i.e. buffer ends
     *    in CR.  In this case we copy everything up to CR to str, but
     *    we also need to see what follows CR: if it's LF, then we
     *    need consume LF as well, so next call to readLine will read
     *    from after that.
     * We use a flag prevCharCR to signal if previous character was CR
     * and, if it happens to be at the end of the buffer, delay
     * consuming it until we have a chance to look at the char that
     * follows.
     */
    str.clear();
    int txtLength = 0; //tracks str.getLength(), as an optimization
    int newlineLength = 0; //length of terminating newline
    boolean prevCharCR = false; //true of prev char was CR
    long bytesConsumed = 0;
    do {
      //bufferPosn记录了当前Buffer读取到哪个位置,这样当下一次循环时
      int startPosn = bufferPosn; //starting from where we left off the last time
      
      //如果Buffer里面的数据已经处理完毕,则对Buffer清空,重新再从IO流读取数据
      if (bufferPosn >= bufferLength) {
        startPosn = bufferPosn = 0;
        if (prevCharCR)
          ++bytesConsumed; //account for CR from previous read

        //从IO中读取数据处理,只有处理完毕(bufferPosn >= bufferLength)才会再次读取
        //bufferLength记录了从IO中读取了多少个字节的数据
        bufferLength = in.read(buffer);
        if (bufferLength <= 0)
          break; // EOF
      }
      //在For循环总寻找断行符, 兼容MAC, Windows, Linux 多种平台的换行符
      for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
    	//判断当前字符是否是'\n'  
        if (buffer[bufferPosn] == LF) {
          //如果是'\r\n'来区分一行, 	newlineLength=2, 如果是'\n'则newlineLength=1
          newlineLength = (prevCharCR) ? 2 : 1;
          ++bufferPosn; // at next invocation proceed from following byte
          break;
        }
        //如果是\r来区分一行,则newlineLength=1
        if (prevCharCR) { //CR + notLF, we are at notLF
          newlineLength = 1;
          break;
        }
        //判断当前字符是否是'\r', 等待下一个循环来组合判断真正的换行符
        prevCharCR = (buffer[bufferPosn] == CR);
      }
      int readLength = bufferPosn - startPosn;
      
      //Buffer最后一个字节就是'\r'
      if (prevCharCR && newlineLength == 0)
        --readLength; //CR at the end of the buffer
      
      
      bytesConsumed += readLength;
      
      //appendLength:在本轮循环中从Buffer中读取的负载长度,去除了换行符
      int appendLength = readLength - newlineLength;
      
      //txtLength:记录了最终返回的str的长度      
      if (appendLength > maxLineLength - txtLength) {
    	//如果添加后,字符串长度超过了一行长度的上限,那么超过的将不会被添加到str
        appendLength = maxLineLength - txtLength;
      }
      
     //将当前Buffer中,指定区间的字符添加到返回值(str)
      if (appendLength > 0) {
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }
    //如果在buffer里面没有读取到换行符,并且已经读取的字节数没有超过预定大小,则继续从IO流读取下一批数据  
    } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

    if (bytesConsumed > (long)Integer.MAX_VALUE)
      throw new IOException("Too many bytes before newline: " + bytesConsumed);    
    return (int)bytesConsumed;
  }

 

 

 

 

下面的代码可以看出LineRecordReader是如何来判读是否需要忽略第一行

 

public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
                                    Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();
    compressionCodecs = new CompressionCodecFactory(job);
    final CompressionCodec codec = compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split
    FileSystem fs = file.getFileSystem(job);
    FSDataInputStream fileIn = fs.open(split.getPath());
    boolean skipFirstLine = false;
    if (codec != null) {
      in = new LineReader(codec.createInputStream(fileIn), job);
      end = Long.MAX_VALUE;
    } else {
      if (start != 0) {
       //只有文件的第一行不能忽略第一行
        skipFirstLine = true;
        --start;
        fileIn.seek(start);
      }
      in = new LineReader(fileIn, job);
    }
    if (skipFirstLine) {  // skip first line and re-establish "start".
      start += in.readLine(new Text(), 0,
                           (int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    this.pos = start;
  }

 

 

 

 参考文献:

http://blog.youkuaiyun.com/bluishglc/article/details/9380087

http://blog.youkuaiyun.com/wanghai__/article/details/6583364

### 实现 WordCount 功能的关键步骤 要在 Eclipse 中实现 WordCount 的功能,可以按照以下方式构建 MapReduce 程序并将其部署到 Hadoop 集群中运行。 #### 1. 创建 Java 项目 在 Eclipse 中创建一个新的 Java 项目,并配置好 Hadoop 所需的依赖库文件。可以通过 Maven 或手动导入 Hadoop JAR 文件来完成环境搭建[^3]。 #### 2. 编写 Mapper 类 Mapper 负责处理输入数据并将每行文本拆分为单词。以下是 `WordCountMapper` 的代码示例: ```java import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> { private final static LongWritable one = new LongWritable(1); private Text word = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws java.io.IOException, InterruptedException { String line = value.toString(); String[] words = line.split("\\s+"); for (String str : words) { word.set(str); context.write(word, one); } } } ``` 上述代码实现了将每一行中的单词提取出来,并为每个单词分配一个计数值 `1`[^1]。 #### 3. 编写 Reducer 类 Reducer 将来自 Mapper 的中间键值对聚合起来计算最终的结果。以下是 `WordCountReducer` 的代码示例: ```java import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws java.io.IOException, InterruptedException { long sum = 0; for (LongWritable val : values) { sum += val.get(); } context.write(key, new LongWritable(sum)); } } ``` 此部分通过迭代器累加相同单词对应的值,从而得到该单词在整个文档集合中的总次数。 #### 4. 主程序驱动逻辑 编写主函数用于设置作业参数以及提交任务给 Hadoop 运行框架。如下所示为主类定义: ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountDriver { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCountDriver.class); // 设置 Driver Class job.setMapperClass(WordCountMapper.class); // 设置 Mapper Class job.setCombinerClass(WordCountReducer.class); // 可选 Combiner 提高性能 job.setReducerClass(WordCountReducer.class); // 设置 Reducer Class job.setOutputKeyClass(Text.class); // 输出 Key 类型 job.setOutputValueClass(LongWritable.class); // 输出 Value 类型 FileInputFormat.addInputPath(job, new Path(args[0])); // 输入路径 FileOutputFormat.setOutputPath(job, new Path(args[1])); // 输出路径 System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` 这段代码设置了整个 MapReduce 流程所需的各类组件及其行为模式。 #### 5. 构建与测试 完成后,在 Eclipse 上右击项目 -> Export -> Runnable Jar file 来导出可执行 JAR 包。随后上传至 HDFS 并利用命令启动任务: ```bash hadoop jar your-wordcount-jar-file.jar com.yourpackage.WordCountDriver /input/path /output/path ``` 这一步骤基于实际开发环境中所指定的具体目录位置调整相应变量名即可[^2]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值