MapReduce过程中的序列化与反序列化

本文深入解析 MapReduce 中的数据序列化、收集、排序与溢出处理过程,详细阐述了 MapOutputBuffer 和 BlockingBuffer 的作用及序列化机制,同时介绍了如何将序列化后的数据直接输出至文件,并通过 IFile.Writer 类进行数据的写入与压缩,最终通过 LineRecordWriter 实现输出到特定文件格式的过程。此外,还探讨了 reduce 过程中如何获取并反序列化数据,以及如何使用 context 对输出数据进行序列化与写入。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

map中的collect过程(序列化):

map过程中的collect主要由MapOutputBuffer来实现数据收集;

变量:BlockingBuffer bb

BlockingBuffer继承自DataOutputStream。其out指向Buffer

准备初始化:

      serializationFactory = newSerializationFactory(job);

      keySerializer = serializationFactory.getSerializer(keyClass);

      keySerializer.open(bb);

      valSerializer = serializationFactory.getSerializer(valClass);

  valSerializer.open(bb);

序列化:

         int keystart= bufindex;

        keySerializer.serialize(key);

        if (bufindex <keystart) {

          // wrapped the key; reset required

          bb.reset();

          keystart = 0;

        }

        // serialize valuebytes into buffer

        finalint valstart= bufindex;

        valSerializer.serialize(value);

    int valend = bb.markRecord();

内容经过BlockingBuffer转为byte写入Buffer中,具体内容存放位置为byte[] kvbuffer

     publicsynchronizedvoid write(byte b[], int off, int len)

          throws IOException {

        boolean buffull = false;

        boolean wrap = false;

        spillLock.lock();

        try {

          do {

            if (sortSpillException != null) {

              throw (IOException)new IOException("Spill failed"

                  ).initCause(sortSpillException);

            }

            // sufficient buffer space?

            if (bufstart <= bufend && bufend <= bufindex) {

              buffull = bufindex + len > bufvoid;

              wrap = (bufvoid - bufindex) + bufstart > len;

            } else {

              // bufindex <= bufstart <= bufend

              // bufend <= bufindex <= bufstart

              wrap = false;

              buffull = bufindex + len > bufstart;

            }

            if (kvstart == kvend) {

              // spill thread not running

              if (kvend != kvindex) {

                // we have records we can spill

                finalboolean bufsoftlimit = (bufindex > bufend)

                  ? bufindex - bufend > softBufferLimit

                  : bufend - bufindex < bufvoid - softBufferLimit;

                if (bufsoftlimit || (buffull && !wrap)) {

                  startSpill();

                }

              } elseif (buffull && !wrap) {

                // We have no buffered records, and this record is too large

                // to write into kvbuffer. We must spill it directly from

                // collect

                finalint size = ((bufend <= bufindex)

                  ? bufindex - bufend

                  : (bufvoid - bufend) + bufindex) + len;

                bufstart = bufend = bufindex = bufmark = 0;

                kvstart = kvend = kvindex = 0;

                bufvoid = kvbuffer.length;

                thrownew MapBufferTooSmallException(size + " bytes");

              }

            }

            if (buffull && !wrap) {

              try {

                while (kvstart != kvend) {

                  reporter.progress();

                  spillDone.await();

                }

              } catch (InterruptedException e) {

                  throw (IOException)new IOException(

                      "Buffer interrupted while waiting for the writer"

                      ).initCause(e);

              }

            }

          } while (buffull && !wrap);

        } finally {

          spillLock.unlock();

        }

        // here, we know that we have sufficient space to write

        if (buffull) {

          finalint gaplen = bufvoid - bufindex;

          System.arraycopy(b, off, kvbuffer, bufindex, gaplen);

          len -= gaplen;

          off += gaplen;

          bufindex = 0;

        }

        System.arraycopy(b, off, kvbuffer, bufindex, len);

        bufindex += len;

      }

    }

map中的sort and spill过程

把序列化到内存中结果直接输入到文件中:

long size = (bufend >= bufstart

          ? bufend - bufstart

          : (bufvoid - bufend) + bufstart) +

                  partitions * APPROX_HEADER_LENGTH;

      FSDataOutputStream out = null;

      try {

        // create spill file

        final SpillRecord spillRec = new SpillRecord(partitions);

        final Path filename = mapOutputFile.getSpillFileForWrite(getTaskID(),

            numSpills, size);

        out = rfs.create(filename);

        finalint endPosition = (kvend > kvstart)

          ? kvend : kvoffsets.length + kvend;

        sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

        int spindex = kvstart;

        IndexRecord rec = new IndexRecord();

        InMemValBytes value = new InMemValBytes();

        for (int i = 0; i < partitions; ++i) {

          IFile.Writer<K, V> writer = null;

          try {

            long segmentStart = out.getPos();

            writer = new Writer<K, V>(job, out, keyClass, valClass, codec,

                                      spilledRecordsCounter);

            if (combinerRunner == null) {

              // spill directly

              DataInputBuffer key = new DataInputBuffer();

              while (spindex < endPosition &&

                  kvindices[kvoffsets[spindex % kvoffsets.length]

                            + PARTITION] == i) {

                finalint kvoff = kvoffsets[spindex % kvoffsets.length];

                getVBytesForOffset(kvoff, value);

                key.reset(kvbuffer, kvindices[kvoff + KEYSTART],

                          (kvindices[kvoff + VALSTART] -

                           kvindices[kvoff + KEYSTART]));

                writer.append(key, value);

                ++spindex;

              }

            } else {

              int spstart = spindex;

              while (spindex < endPosition &&

                  kvindices[kvoffsets[spindex % kvoffsets.length]

                            + PARTITION] == i) {

                ++spindex;

              }

              // Note: we would like to avoid the combiner if we've fewer

              // than some threshold of records for a partition

              if (spstart != spindex) {

                combineCollector.setWriter(writer);

                RawKeyValueIterator kvIter =

                  new MRResultIterator(spstart, spindex);

                combinerRunner.combine(kvIter, combineCollector);

              }

            }

            // close the writer

            writer.close();

            // record offsets

            rec.startOffset = segmentStart;

            rec.rawLength = writer.getRawLength();

            rec.partLength = writer.getCompressedLength();

            spillRec.putIndex(rec, i);

            writer = null;

          } finally {

            if (null != writer) writer.close();

          }

        }

        if (totalIndexCacheMemory >= INDEX_CACHE_MEMORY_LIMIT) {

          // create spill index file

          Path indexFilename = mapOutputFile.getSpillIndexFileForWrite(

              getTaskID(), numSpills,

              partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH);

          spillRec.writeToFile(indexFilename, job);

        } else {

          indexCacheList.add(spillRec);

          totalIndexCacheMemory +=

            spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;

        }

        ++numSpills;

      } finally {

        if (out != null) out.close();

      }

从代码中可以看出spill过程中,调用了内部类IFile. Writer的append方法

publicvoid append(DataInputBuffer key, DataInputBuffer value)

    throws IOException {

      int keyLength = key.getLength() - key.getPosition();       

      int valueLength = value.getLength() - value.getPosition();

      WritableUtils.writeVInt(out, keyLength);

      WritableUtils.writeVInt(out, valueLength);

      out.write(key.getData(), key.getPosition(), keyLength);

      out.write(value.getData(), value.getPosition(), valueLength);

      // Update bytes written

      decompressedBytesWritten += keyLength + valueLength +

                      WritableUtils.getVIntSize(keyLength) +

                      WritableUtils.getVIntSize(valueLength);

      ++numRecordsWritten;

    }

reduce中获取Iterable<VALUEIN> values(反序列化)

变量:DataInputBuffer buffer;

准备初始化:

   this.keyDeserializer=                     

      serializationFactory.getDeserializer(keyClass);

    this.keyDeserializer.open(buffer);

    this.valueDeserializer=      

      serializationFactory.getDeserializer(valueClass);

this.valueDeserializer.open(buffer);

主要在reduce过程中的reduce(KEYIN key,Iterable<VALUEIN> values, Context context)函数,valuesnext方法如下:

public VALUEIN next() {

      // if this is the first record, we don't need to advance

      if (firstValue) {

        firstValue = false;

        returnvalue;

      }

      // if this isn't the first record and the next key is different, they

      // can't advance it here.

      if (!nextKeyIsSame) {

        thrownew NoSuchElementException("iterate past last value");

      }

      // otherwise, go to the next key/value pair

      try {

        nextKeyValue();

        returnvalue;

      } catch (IOException ie) {

        thrownew RuntimeException("next value iterator failed", ie);

      } catch (InterruptedException ie) {

        // this is bad, but we can't modify the exception list of java.util

        thrownew RuntimeException("next value iterator interrupted", ie);       

      }

    }

publicboolean nextKeyValue() throws IOException, InterruptedException {

    if (!hasMore) {

      key = null;

      value = null;

      returnfalse;

    }

    firstValue = !nextKeyIsSame;

    DataInputBuffer next = input.getKey();

    currentRawKey.set(next.getData(), next.getPosition(),

                      next.getLength() - next.getPosition());

    buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());

    key = keyDeserializer.deserialize(key);

    next = input.getValue();

    buffer.reset(next.getData(), next.getPosition(), next.getLength());

    value = valueDeserializer.deserialize(value);

    hasMore = input.next();

    if (hasMore) {

      next = input.getKey();

      nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,

             currentRawKey.getLength(), next.getData(),   next.getPosition(),

         next.getLength() - next.getPosition()) == 0;

    } else {

      nextKeyIsSame = false;

    }

    inputValueCounter.increment(1);

    returntrue;

  }

 

reduce中context.write【输出到TextOutputStream中为例】(序列化)

context.write((KEYOUT) key, (VALUEOUT) value);方法调用output.write(key, value);此时output指向LineRecordWriter<K,V>

LineRecordWriter<K, V>write方法逻辑如下:

publicsynchronizedvoid write(K key, V value)

      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;

      boolean nullValue = value == null || value instanceof NullWritable;

      if (nullKey && nullValue) {

        return;

      }

      if (!nullKey) {

        writeObject(key);

      }

      if (!(nullKey || nullValue)) {

        out.write(keyValueSeparator);

      }

      if (!nullValue) {

        writeObject(value);

      }

      out.write(newline);

    }

privatevoid writeObject(Object o) throws IOException {

      if (o instanceof Text) {

        Text to = (Text) o;

        out.write(to.getBytes(), 0, to.getLength());

      } else {

        out.write(o.toString().getBytes(utf8));

      }

    }

代码中的out指向FSDataOutputStream

Path file = getDefaultWorkFile(job, extension);

    FileSystem fs = file.getFileSystem(conf);

    if (!isCompressed) {

      FSDataOutputStream fileOut = fs.create(file, false);

      returnnew LineRecordWriter<K, V>(fileOut, keyValueSeparator);

    } else {

      FSDataOutputStream fileOut = fs.create(file, false);

      returnnew LineRecordWriter<K, V>(new DataOutputStream

                                        (codec.createOutputStream(fileOut)),

                                        keyValueSeparator);

    }

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值