深度学习之文本与序列--基于Keras的IMDB电影评论分类

最新推荐文章于 2022-08-01 20:09:27 发布

原创

最新推荐文章于 2022-08-01 20:09:27 发布 · 3.3k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习

【应用场景】
在深度学习中，文本和序列有着很多的应用场景：

文本分类、时间序列分类。eg. 确定一篇文章的主题，确定一本书的作者
时间序列的相互比较。eg. 文本相似度，股票行情预测
语言序列的学习。eg. 英译汉，汉译英，翻译系统
情感分析。eg. 一条微博携带的情感色彩，电影评论好与坏
时间序列预测。eg. 在一个确定地点预测未来的天气，给出最近的天气

【用文本数据工作】
在深度学习的模型，并不会将原始的文本数据直接送进神经网络中，会将文本装换成数值张量，向量化是其中的一种方式。有很多种不同的方式：

将text分割成word，将每个word装换成vector；
将text分割成character，将每个character装换成vector；
提取word和character的n-gram，并将每个n-gram转换成一个vector。

解释一下几个在文本处理中常用的几个名词：

token：指的是将文本分割成word、character、n-gram，其中word、character、n-gram均可称为是token
tokenization：将文本转化成token的过程；
n-grams：从一句话中抽出N个连续词组成的集合。举个例子：“The cat sat on the mat.”；
那么2-grams：{“The”, “The cat”, “cat”, “cat sat”, “sat”, “sat on”, “on”, “on the”, “the”, “the mat”, “mat”}
，同样，3-grams：{“The”, “The cat”, “cat”, “cat sat”, “The cat sat”, “sat”, “sat on”, “on”, “cat sat on”, “on the”, “the”, “sat on the”, “the mat”, “mat”, “on the mat”}；
bag-of-words：指的是无序的词库集合，也就是经过tokenization之后产生的集合。

Tips:

在提取n-grams时，其实这个过程就像是在提取一句话的特征，那么在深度学习中，就是用一维卷积、RNN等方法去替代n-grams；

虽然现在越来越多的复杂的任务均转移到了深度学习，但是在处理一些轻量级的任务时，不可避免的去使用n-grams，还有一些传统高效的方法，比如：逻辑回归、随机森林。

【token的两种编码方式】

one-hot
one-hot是一种常见的编码方式，通过几个toy example来了解一下

词(word)级别的one-hot编码

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# 10
# 定义一个集合，得到{'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}，也就是筛选出这个句子中对应的了哪些词，然后并赋予索引值，其实就是个词库
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1

#　限制了读取的句子的长度，一句话最长10个词
max_length = 10
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values()) + 1))

# print(results) 2, 10, 11
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.
print(results)

The cat sat on the mat.
The dog ate my homework.

字符(character)级别的one-hot编码

import numpy as np
import string
samples = ['The cat sa

最低0.47元/天解锁文章