【深度学习】将文本数据转换为张量的方法总结

最新推荐文章于 2024-07-26 00:04:15 发布

原创

最新推荐文章于 2024-07-26 00:04:15 发布 · 3.3k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #人工智能

3.用keras实现单词级的one-hot编码

4.用散列技巧的单词级的one-hot1编码

参考：

问题描述：

深度学习模型不会接收原始文本作为输入，它只能处理数值张量。文本向量化（vectorize）是指将文本转换为数值张量的过程。实现方法：①文本中的每个单词转换为一个向量.②文本中的每个字符转换为一个向量。

方法概括：

1.单词级的one-hot编码

代码展示

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

#构建数据中被标记的索引
token_index = {}
for sample in samples:
    #利用split方法进行分词
    for word in sample.split():
        if word not in token_index:
            # 为唯一单词指定唯一索引
            token_index[word] = len(token_index) + 1


max_length = 10

#结果保存在result中
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j,