用TensorFlow中内置的vocabulary processor处理单词

最新推荐文章于 2025-06-03 06:25:20 发布

原创

最新推荐文章于 2025-06-03 06:25:20 发布 · 2.2k 阅读

4 ·

CC 4.0 BY-SA版权

一般我们在进行文本处理时，需要写方法建立词汇表和word到idx,以及idx到word的映射关系，这就需要统计词汇表中的所有单词并建立相应的词典。

在建立文档到idx的映射关系时，我们也可以用tensorflow内置的preprocessing.VocabularyProcessor来建立word到idx的映射关系。

VocabularyProcessor：Maps documents to sequences of word ids

class VocabularyProcessor(object):
  """Maps documents to sequences of word ids."""

  def __init__(self,
               max_document_length,
               min_frequency=0,
               vocabulary=None,
               tokenizer_fn=None):
    """Initializes a VocabularyProcessor instance.

    Args:
      max_document_length: Maximum length of documents.
        if documents are longer, they will be trimmed, if shorter - padded.
      min_frequency: Minimum frequency of words in the vocabulary.
      vocabulary: CategoricalVocabulary object.

    Attributes:
      vocabulary_: CategoricalVocabulary object.
    """

max_docyment_length:是文档的最大长度，如果一个句子超过了这个最大长度，则将会被截断，后面的不要。如