mahout SparseVectorsFromSequenceFiles详解（5）

最新推荐文章于 2013-03-08 15:14:38 发布

原创最新推荐文章于 2013-03-08 15:14:38 发布 · 1k 阅读

0 ·

CC 4.0 BY-SA版权

mahout 专栏收录该内容

33 篇文章

订阅专栏

本文介绍了一个用于生成词汇字典chunks的过程，通过将词汇及其对应的整数标识存储为sequencefile格式，实现对词汇的有效管理和查询。当单个chunk达到预设大小限制时，会自动创建新的chunk。

这一部分讲述createDictionaryChunks

参数

wordCountPath，这是输入目录，即上面wordcount目录

dictionaryPathBase，输出目录

其它几个参数很明显

代码很简单

    List<Path> chunkPaths = Lists.newArrayList();

    Configuration conf = new Configuration(baseConf);

    FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);

    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;
    int chunkIndex = 0;
    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
    chunkPaths.add(chunkPath);

    SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

    try {
      long currentChunkSize = 0;
      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);
      int i = 0;
      for (Pair<Writable,Writable> record
           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {
        if (currentChunkSize > chunkSizeLimit) {
          Closeables.closeQuietly(dictWriter);
          chunkIndex++;

          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
          chunkPaths.add(chunkPath);

          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);
          currentChunkSize = 0;
        }

        Writable key = record.getFirst();
        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;
        currentChunkSize += fieldSize;
        dictWriter.append(key, new IntWritable(i++));
      }
      maxTermDimension[0] = i;
    } finally {
      Closeables.closeQuietly(dictWriter);
    }

    return chunkPaths;

就是生成词汇的sequence file，因为sequence file需要key-value形式，value弄了个自增整数，表示这个词属于vector的那个dimension

当一个chunk写满后，会新增加新的chunk