深度学习文本预处理利器：Tokenizer详解

原创

已于 2023-09-21 16:58:13 修改 · 3.3w 阅读

148 ·

CC 4.0 BY-SA版权

文章标签：

#keras #人工智能 #深度学习 #词向量 #Tokenizer

于 2023-09-21 16:12:46 首次发布

本文介绍了Tokenizer，它可将文本转换为序列并向量化。阐述了其定义、方法和属性，还分别展示了英文和中文文本向量化的过程，英文默认处理后索引或向量化，中文需用jieba分词。处理后的向量化数据可用于深度神经网络模型训练和推理。

该文章已生成可运行项目，

1 Tokenizer 介绍

Tokenizer是一个用于向量化文本，将文本转换为序列的类。计算机在处理语言文字时，是无法理解文字含义的，通常会把一个词（中文单个字或者词）转化为一个正整数，将一个文本就变成了一个序列，然后再对序列进行向量化，向量化后的数据送入模型处理。

Tokenizer 允许使用两种方法向量化一个文本语料库：将每个文本转化为一个整数序列（每个整数都是词典中标记的索引）；或者将其转化为一个向量，其中每个标记的系数可以是二进制值、词频、TF-IDF权重等。

1.1 Tokenizer定义

keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

参数说明：

num_words: 需要保留的最大词数，基于词频。只有最常出现的 num_words 词会被保留。

filters: 一个字符串，其中每个元素是一个将从文本中过滤掉的字符。默认值是所有标点符号，加上制表符和换行符，减去 ’ 字符。

lower: 布尔值。是否将文本转换为小写。

split: 字符串。按该字符串切割文本。

char_level: 如果为 True，则每个字符都将被视为标记。

oov_token: 如果给出，它将被添加到 word_index 中，并用于在 text_to_sequence 调用期间替换词汇表外的单词。

1.2 Tokenizer方法

（1）fit_on_texts(texts)

参数 texts：要用以训练的文本列表。
返回值：无。

（2）texts_to_sequences(texts)

参数 texts：待转为序列的文本列表。
返回值：序列的列表，列表中每个序列对应于一段输入文本。

（3）texts_to_sequences_generator(texts)

本函数是texts_to_sequences的生成器函数版。

参数 texts：待转为序列的文本列表。
返回值：每次调用返回对应于一段输入文本的序列。

（4）texts_to_matrix(texts, mode) ：

参数 texts：待向量化的文本列表。
参数 mode：'binary'，'count'，'tfidf'，'freq' 之一，默认为 'binary'。
返回值：形如(len(texts), num_words) 的numpy array。

（5）fit_on_sequences(sequences) ：

参数 sequences：要用以训练的序列列表。
返回值：无

（5）sequences_to_matrix(sequences) ：

参数 sequences：待向量化的序列列表。
参数 mode：'binary'，'count'，'tfidf'，'freq' 之一，默认为 'binary'。
返回值：形如(len(sequences), num_words) 的 numpy array。

1.3 Tokenizer属性

（1）word_counts

类型：字典

描述：将单词（字符串）映射为它们在训练期间出现的次数。仅在调用fit_on_texts之后设置。

（2）word_docs

类型：字典

描述：将单词（字符串）映射为它们在训练期间所出现的文档或文本的数量。仅在调用fit_on_texts之后设置。

（3）word_index

类型：字典，

描述：将单词（字符串）映射为它们的排名或者索引。仅在调用fit_on_texts之后设置。

（4）document_count

类型：整数。

描述：分词器被训练的文档（文本或者序列）数量。仅在调用fit_on_texts或fit_on_sequences之后设置。

2 Tokenizer文本向量化

2.1 英文文本向量化

默认情况下，删除所有标点符号，将文本转换为空格分隔的单词序列（单词可能包含 ’ 字符）。这些序列然后被分割成标记列表。然后它们将被索引或向量化。0是不会被分配给任何单词的保留索引。

from keras.preprocessing.text import Tokenizer

texts = ["Life is a journey, and if you fall in love with the journey, you will be in love forever.",
         "Dreams are like stars, you may never touch them, but if you follow them, they will lead you to your destiny.",
         "Memories are the heart's treasures, they hold the wisdom and beauty of our past.",
         "Nature is the most beautiful artist, its paintings are endless and always breathtaking.",
         "True happiness is not about having everything, but about being content with what you have.",
         "Wisdom comes with age, but more often with experience.",
         "Music has the power to transport us to a different place, a different time.",
         "Love is blind, but often sees more than others.",
         "Time heals all wounds, but only if you let it.",
         "Home is where the heart is, and for many, that is where the memories are."]

tokenizer = Tokenizer(num_words=64, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True, split=' ', char_level=False, oov_token=None,
                      document_count=0)

# 根据输入的文本列表更新内部字典
tokenizer.fit_on_texts(texts)

print("处理的文档数量,document_count: ", tokenizer.document_count)
print("单词到索引的映射,word_index: \n", tokenizer.word_index)
print("索引到单词的映射,index_word: \n", tokenizer.index_word)
print("每个单词出现的总频次,word_counts: \n", tokenizer.word_counts)
print("出现单词的文档的数量,word_docs: \n", tokenizer.word_docs)
print("单词索引对应的出现单词的文档的数量,index_docs: \n", tokenizer.index_docs)

运行结果显示如下：

处理的文档数量,document_count:  10
单词到索引的映射,word_index: 
 {'is': 1, 'you': 2, 'the': 3, 'but': 4, 'and': 5, 'with': 6, 'are': 7, 'a': 8, 'if': 9, 'love': 10, 'to': 11, 'journey': 12, 'in': 13, 'will': 14, 'them': 15, 'they': 16, 'mem

本文章已经生成可运行项目