这一部分讲述createDictionaryChunks
参数
wordCountPath,这是输入目录,即上面wordcount目录
dictionaryPathBase,输出目录
其它几个参数很明显
代码很简单
List<Path> chunkPaths = Lists.newArrayList();
Configuration conf = new Configuration(baseConf);
FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);
long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;
int chunkIndex = 0;
Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
chunkPaths.add(chunkPath);
SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);
try {
long currentChunkSize = 0;
Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);
int i = 0;
for (Pair<Writable,Writable> record
: new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {
if (currentChunkSize > chunkSizeLimit) {
Closeables.closeQuietly(dictWriter);
chunkIndex++;
chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
chunkPaths.add(chunkPath);
dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);
currentChunkSize = 0;
}
Writable key = record.getFirst();
int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;
currentChunkSize += fieldSize;
dictWriter.append(key, new IntWritable(i++));
}
maxTermDimension[0] = i;
} finally {
Closeables.closeQuietly(dictWriter);
}
return chunkPaths;就是生成词汇的sequence file,因为sequence file需要key-value形式,value弄了个自增整数,表示这个词属于vector的那个dimension
当一个chunk写满后,会新增加新的chunk
本文介绍了一个用于生成词汇字典chunks的过程,通过将词汇及其对应的整数标识存储为sequencefile格式,实现对词汇的有效管理和查询。当单个chunk达到预设大小限制时,会自动创建新的chunk。
2008

被折叠的 条评论
为什么被折叠?



