数据写入过程源码分析
数据写入过程
构造ParquetRecordWriter时构造InternalParquetRecordWriter
public ParquetRecordWriter(
...){
internalWriter = new InternalParquetRecordWriter<T>(w, writeSupport, schema,
extraMetaData, blockSize, pageSize, compressor, dictionaryPageSize, enableDictionary, validating, writerVersion);
}
构造InternalParquetRecordWriter时触发initStore
public InternalParquetRecordWriter(
...
initStore()
...)
执行ParquetRecordWriter.write时,执行InternalParquetRecordWriter.write
public void write(Void key, T value) throws IOException, InterruptedException {
internalWriter.write(value);
}
InternalParquetRecordWriter.write方法实现如下:
public void write(T value) throws IOException, InterruptedException {
writeSupport.write(value);
++ recordCount;
checkBlockSizeReached();
}
initStore阶段
int initialBlockBufferSize = max(MINIMUM_BUFFER_SIZE, rowGroupSize / schema.getColumns().size() / 5);
int initialPageBufferSize = max(MINIMUM_BUFFER_SIZE, min(pageSize + pageSize / 10, initialBlockBufferSize));
初始化BlockBufferSize和PageBufferSize
pageStore = new ColumnChunkPageWriteStore(compressor, schema, initialBlockBufferSize);
columnStore = new ColumnWriteStoreImpl(pageStore, pageSize, initialPageBufferSize, dictionaryPageSize, enableDictionary, writerVersion);
初始化pageSWritertore和ColumWriterStore,分别含有PageWriter和ColumnWriter内部类进行数据写入操作
数据写入准备阶段
MessageColumnIO columnIO = new ColumnIOFactory(validating).getColumnIO(schema);
writeSupport.prepareForWrite(columnIO.getRecordWriter(columnStore));
初始化MessageColumnIO,传递schema信息,为数据写入做准备
validating值在ParquetRecordWrite初始化时被确定,默认为false,如下代码所示