Keras内置函数的单词级one-hot编码_keras onehot编码函数-优快云博客

本文链接：https://blog.youkuaiyun.com/C_chuxin/article/details/85301959

本文介绍Keras库中Tokenizer类的使用方法，包括构建单词级One-Hot编码的过程。通过具体代码示例，展示了如何创建Tokenizer实例，生成单词索引，将文本转换为整数索引列表及One-Hot编码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

【时间】2018.12.27

【题目】Keras内置函数的单词级one-hot编码

概述

本文是对keras.preprocessing.text中的Tokenizer类的一些方法的讲解，用于构建单词级one-hot编码。其他one-hotd的实现方法可以参考：[Deep-Learning-with-Python] 文本序列中的深度学习

一、Tokenizer类的一些方法

tokenizer = Tokenizer(num_words=1000) ：构造函数， (num_words=100表示考虑1000个最常见的单词
tokenizer.fit_on_texts(samples) ：生成word index，此方法用于构建单词索引的列表

word_index= tokenizer.word_index：获取的word index，这是一个从1开始的字典，类似于{‘The’：1，‘cat’:2,.....},

sequences = tokenizer.texts_to_sequences(samples) ：#将文本转换成整数索引的列表，类似于[[1,2],[1,2]],列表的长度为len(samples),列表元素为字符串中每个单词的整数索引组成的列表。
one_hot_results = tokenizer.texts_to_matrix(samples,mode='binary')#文本直接转换为one-hot编码，model用于指明向量模式，这是一个（len(samples),nums,_words)的数组array，value=1表示该索引对应的单词在字符串中存在，value=0表示不存在

二、具体例子

【代码】

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework, he dog ate my homework.']



# We create a tokenizer, configured to only take

# into account the top-1000 most common words

tokenizer = Tokenizer(num_words=100)

# This builds the word index

tokenizer.fit_on_texts(samples)



# This is how you can recover the word index that was computed

word_index = tokenizer.word_index

print('word_index:',word_index)



# This turns strings into lists of integer indices.

sequences = tokenizer.texts_to_sequences(samples)

print('sequences:',sequences)

print('sequences.type',type(sequences))



# You could also directly get the one-hot binary representations.

# Note that other vectorization modes than one-hot encoding are supported!

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

print('one_hot_results:\n',one_hot_results)

print('one_hot_results.shape:',one_hot_results.shape)

print('one_hot_results.type:',type(one_hot_results))

print('Found %s unique tokens.' % len(word_index))

【运行结果】