MapReduce过程中的序列化与反序列化_keyserializer = serializationfactory.getserializer-优快云博客

本文链接：https://blog.youkuaiyun.com/i86tech2018/article/details/25929587

本文深入解析 MapReduce 中的数据序列化、收集、排序与溢出处理过程，详细阐述了 MapOutputBuffer 和 BlockingBuffer 的作用及序列化机制，同时介绍了如何将序列化后的数据直接输出至文件，并通过 IFile.Writer 类进行数据的写入与压缩，最终通过 LineRecordWriter 实现输出到特定文件格式的过程。此外，还探讨了 reduce 过程中如何获取并反序列化数据，以及如何使用 context 对输出数据进行序列化与写入。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

map中的collect过程（序列化）：

map过程中的collect主要由MapOutputBuffer来实现数据收集；

变量：BlockingBuffer bb；

BlockingBuffer继承自DataOutputStream。其out指向Buffer。

准备初始化：

serializationFactory = newSerializationFactory(job);

keySerializer = serializationFactory.getSerializer(keyClass);

keySerializer.open(bb);

valSerializer = serializationFactory.getSerializer(valClass);

valSerializer.open(bb);

序列化：

int keystart= bufindex;

keySerializer.serialize(key);

if (bufindex <keystart) {

// wrapped the key; reset required

bb.reset();

keystart = 0;

}

// serialize valuebytes into buffer

finalint valstart= bufindex;

valSerializer.serialize(value);

int valend = bb.markRecord();

内容经过BlockingBuffer转为byte后写入Buffer中，具体内容存放位置为byte[] kvbuffer：

publicsynchronizedvoid write(byte b[], int off, int len)

throws IOException {

boolean buffull = false;

boolean wrap = false;

spillLock.lock();

try {

do {

if (sortSpillException != null) {

throw (IOException)new IOException("Spill failed"

).initCause(sortSpillException);

}

// sufficient buffer space?

if (bufstart <= bufend && bufend <= bufindex) {

buffull = bufindex + len > bufvoid;

wrap = (bufvoid - bufindex) + bufstart > len;

} else {

// bufindex <= bufstart <= bufend

// bufend <= bufindex <= bufstart

wrap = false;

buffull = bufindex + len > bufstart;

}

if (kvstart == kvend) {

// spill thread not running

if (kvend != kvindex) {

// we have records we can spill

finalboolean bufsoftlimit = (bufindex > bufend)

? bufindex - bufend > softBufferLimit

: bufend - bufindex < bufvoid - softBufferLimit;

if (bufsoftlimit || (buffull && !wrap)) {

startSpill();

}

} elseif (buffull && !wrap) {

// We have no buffered records, and this record is too large

// to write into kvbuffer. We must spill it directly from

// collect

finalint size = ((bufend <= bufindex)

? bufindex - bufend

: (bufvoid - bufend) + bufindex) + len;

bufstart = bufend = bufindex = bufmark = 0;

kvstart = kvend = kvindex = 0;

bufvoid = kvbuffer.length;

thrownew MapBufferTooSmallException(size + " bytes");

}

if (buffull && !wrap) {

try {

while (kvstart != kvend) {

reporter.progress();

spillDone.await();

}

} catch (InterruptedException e) {

throw (IOException)new IOException(

"Buffer interrupted while waiting for the writer"

).initCause(e);

}

} while (buffull && !wrap);

} finally {

spillLock.unlock();

}

// here, we know that we have sufficient space to write

if (buffull) {

finalint gaplen = bufvoid - bufindex;

System.arraycopy(b, off, kvbuffer, bufindex, gaplen);

len -= gaplen;

off += gaplen;

bufindex = 0;

}

System.arraycopy(b, off, kvbuffer, bufindex, len);

bufindex += len;

}

map中的sort and spill过程

把序列化到内存中结果直接输入到文件中：

long size = (bufend >= bufstart

? bufend - bufstart

: (bufvoid - bufend) + bufstart) +

partitions * APPROX_HEADER_LENGTH;

FSDataOutputStream out = null;

try {

// create spill file

final SpillRecord spillRec = new SpillRecord(partitions);

final Path filename = mapOutputFile.getSpillFileForWrite(getTaskID(),

numSpills, size);

out = rfs.create(filename);

finalint endPosition = (kvend > kvstart)

? kvend : kvoffsets.length + kvend;

sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

int spindex = kvstart;

IndexRecord rec = new IndexRecord();

InMemValBytes value = new InMemValBytes();

for (int i = 0; i < partitions; ++i) {

IFile.Writer<K, V> writer = null;

try {

long segmentStart = out.getPos();

writer = new Writer<K, V>(job, out, keyClass, valClass, codec,

spilledRecordsCounter);

if (combinerRunner == null) {

// spill directly

DataInputBuffer key = new DataInputBuffer();

while (spindex < endPosition &&

kvindices[kvoffsets[spindex % kvoffsets.length]

+ PARTITION] == i) {

finalint kvoff = kvoffsets[spindex % kvoffsets.length];

getVBytesForOffset(kvoff, value);

key.reset(kvbuffer, kvindices[kvoff + KEYSTART],

(kvindices[kvoff + VALSTART] -

kvindices[kvoff + KEYSTART]));

writer.append(key, value);

++spindex;

}

} else {

int spstart = spindex;

while (spindex < endPosition &&

kvindices[kvoffsets[spindex % kvoffsets.length]

+ PARTITION] == i) {

++spindex;

}

// Note: we would like to avoid the combiner if we've fewer

// than some threshold of records for a partition

if (spstart != spindex) {

combineCollector.setWriter(writer);

RawKeyValueIterator kvIter =

new MRResultIterator(spstart, spindex);

combinerRunner.combine(kvIter, combineCollector);

}

// close the writer

writer.close();

// record offsets

rec.startOffset = segmentStart;

rec.rawLength = writer.getRawLength();

rec.partLength = writer.getCompressedLength();

spillRec.putIndex(rec, i);

writer = null;

} finally {

if (null != writer) writer.close();

}

if (totalIndexCacheMemory >= INDEX_CACHE_MEMORY_LIMIT) {

// create spill index file

Path indexFilename = mapOutputFile.getSpillIndexFileForWrite(

getTaskID(), numSpills,

partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH);

spillRec.writeToFile(indexFilename, job);

} else {

indexCacheList.add(spillRec);

totalIndexCacheMemory +=

spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;

}

++numSpills;

} finally {

if (out != null) out.close();

}

从代码中可以看出spill过程中，调用了内部类IFile. Writer的append方法：

publicvoid append(DataInputBuffer key, DataInputBuffer value)

throws IOException {

int keyLength = key.getLength() - key.getPosition();

int valueLength = value.getLength() - value.getPosition();

WritableUtils.writeVInt(out, keyLength);

WritableUtils.writeVInt(out, valueLength);

out.write(key.getData(), key.getPosition(), keyLength);

out.write(value.getData(), value.getPosition(), valueLength);

// Update bytes written

decompressedBytesWritten += keyLength + valueLength +

WritableUtils.getVIntSize(keyLength) +

WritableUtils.getVIntSize(valueLength);

++numRecordsWritten;

}

reduce中获取Iterable<VALUEIN> values（反序列化)

变量：DataInputBuffer buffer;

准备初始化：

this.keyDeserializer=

serializationFactory.getDeserializer(keyClass);

this.keyDeserializer.open(buffer);

this.valueDeserializer=

serializationFactory.getDeserializer(valueClass);

this.valueDeserializer.open(buffer);

主要在reduce过程中的reduce(KEYIN key,Iterable<VALUEIN> values, Context context)函数，values的next方法如下：

public VALUEIN next() {

// if this is the first record, we don't need to advance

if (firstValue) {

firstValue = false;

returnvalue;

}

// if this isn't the first record and the next key is different, they

// can't advance it here.

if (!nextKeyIsSame) {

thrownew NoSuchElementException("iterate past last value");

}

// otherwise, go to the next key/value pair

try {

nextKeyValue();

returnvalue;

} catch (IOException ie) {

thrownew RuntimeException("next value iterator failed", ie);

} catch (InterruptedException ie) {

// this is bad, but we can't modify the exception list of java.util

thrownew RuntimeException("next value iterator interrupted", ie);

}

publicboolean nextKeyValue() throws IOException, InterruptedException {

if (!hasMore) {

key = null;

value = null;

returnfalse;

}

firstValue = !nextKeyIsSame;

DataInputBuffer next = input.getKey();

currentRawKey.set(next.getData(), next.getPosition(),

next.getLength() - next.getPosition());

buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());

key = keyDeserializer.deserialize(key);

next = input.getValue();

buffer.reset(next.getData(), next.getPosition(), next.getLength());

value = valueDeserializer.deserialize(value);

hasMore = input.next();

if (hasMore) {

next = input.getKey();

nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,

currentRawKey.getLength(), next.getData(), next.getPosition(),

next.getLength() - next.getPosition()) == 0;

} else {

nextKeyIsSame = false;

}

inputValueCounter.increment(1);

returntrue;

}

reduce中context.write【输出到TextOutputStream中为例】（序列化）

context.write((KEYOUT) key, (VALUEOUT) value);方法调用output.write(key, value);此时output指向LineRecordWriter<K,V>。

LineRecordWriter<K, V>的write方法逻辑如下：

publicsynchronizedvoid write(K key, V value)

throws IOException {

boolean nullKey = key == null || key instanceof NullWritable;

boolean nullValue = value == null || value instanceof NullWritable;

if (nullKey && nullValue) {

return;

}

if (!nullKey) {

writeObject(key);

}

if (!(nullKey || nullValue)) {

out.write(keyValueSeparator);

}

if (!nullValue) {

writeObject(value);

}

out.write(newline);

}

privatevoid writeObject(Object o) throws IOException {

if (o instanceof Text) {

Text to = (Text) o;

out.write(to.getBytes(), 0, to.getLength());

} else {

out.write(o.toString().getBytes(utf8));

}

代码中的out指向FSDataOutputStream。

Path file = getDefaultWorkFile(job, extension);

FileSystem fs = file.getFileSystem(conf);

if (!isCompressed) {

FSDataOutputStream fileOut = fs.create(file, false);

returnnew LineRecordWriter<K, V>(fileOut, keyValueSeparator);

} else {

FSDataOutputStream fileOut = fs.create(file, false);

returnnew LineRecordWriter<K, V>(new DataOutputStream

(codec.createOutputStream(fileOut)),

keyValueSeparator);

}