【TF2.0-NLP】Hello world（单词向量化）

文本向量化与序列处理

最新推荐文章于 2025-06-01 23:22:00 发布

原创最新推荐文章于 2025-06-01 23:22:00 发布 · 936 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#RNN #单词向量化

TF2.0 同时被 2 个专栏收录

15 篇文章

订阅专栏

NLP

2 篇文章

订阅专栏

本文深入探讨了使用TensorFlow库进行文本预处理的方法，包括单词向量化、文本序列化及序列填充技巧，展示了如何利用Tokenizer和pad_sequences实现文本数据的标准化，为后续的机器学习和深度学习任务奠定基础。

部署运行你感兴趣的模型镜像

单词向量化

【例1】

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

【输出】

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}

注意：词汇中有i 和I，但编码结果只有i

例2】

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

【输出】

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

解析：比例1多了一个句子，但总共只增加了一个词you，向量化结果相应的多了｛'you': 6｝这个编码

注意：新增的句子结尾有！号，但！号并没有编码。

文本序列化

单词向量化的结果是得到一个词汇表——可以理解为一本字典，每个单词用一个数字编码。在这基础上，可以把文本转成用数字序列。

【例3】

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

【输出】

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

解析：1.把所有句子中出现过的单词向量化；

2.把4个句子分别转换成用4个向量

【例4】用将新句子序列化（包含新词）

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)

test_data = [
    "i really love my dog",
    "my dog loves my manatee"
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

【输出】

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [1, 3, 1]]

解析:

'i really love my dog' ==>[4, 2, 1, 3]

其中单词'really'不在词汇中，被直接忽略了。

'my dog loves my manatee' ==>[1, 3, 1]

其中单词'loves'、'manatee'不在词汇中，也被忽略了。

原句中的单词被无情忽略不太好吧，应该有更好的处理办法：

【例5】


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index


print(word_index)

test_data = [
    "i really love my dog",
    "my dog loves my manatee"
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

【输出】

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

解析：加入了oov_token词汇，oov是out of vocabulary的缩写，意思是如果遇到不在词汇中的单词将用<OOV>来代替。其中"<OOV>"是可以随便取的，但不能跟已有词汇冲突。

填充

为了方便训练，需要把训练数据中的句子统一长度，此时短句填充些东西。

【例6】


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences)

【输出】

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]

解析：

'i love my dog' ==> [5, 3, 2, 4] ==>[ 0 0 0 5 3 2 4]

可以看出：

默认是从前面开始填充的
句子长度是等于所有句子中最长的句子的长度

其实也可以从后面填充，用pad_sequences中的pdding参数指定：

padded = pad_sequences(sequences, padding='post')

此外你还可以指定编码后句子最大的长度:

padded = pad_sequences(sequences, padding='post', maxlen=5)

但是问题来了，如果我的句子比指定的长，那么势必会丢失信息，关键是从哪里丢。默认是从前面开始，但也可以通过truncating来指定：

padded = pad_sequences(sequences, padding='post', truncating='post',maxlen=5)

【来个稍微复杂一点的例子】

先下载sarcasm_t.json文件。


import json

with open("D:/tmp/sarcasm_t.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []
urls = []

for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

print(sentences[2])
print(padded[2])
print(padded.shape)

【输出】

[   4 8435 3338 2746   22    2  166 8436  416 3113    6  258    9 1002
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]
(26710, 40)

探索BBC新闻

【例7】将BBC新闻分类及正文分别向量化

'''
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O /tmp/bbc-text.csv

'''

import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but",
             "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here",
             "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more",
             "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should",
             "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've",
             "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's",
             "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]


sentences = []
labels = []
with open("bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentence = row[1]
        for word in stopwords:
            token = " " + word + " "
            sentence = sentence.replace(token, " ")
            sentence = sentence.replace("  ", " ")
        sentences.append(sentence)


print(len(sentences))
print(sentences[0])


tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)
print(label_seq)
print(label_word_index)

解析：

从https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv下载BBC新闻；
新闻分athletics, cricket, football, rugby, tennis类；
文件bbc-text.csv，每行一条新闻，格式是：<label>,<正文> 其中label就是第2点中的新闻分类；
本例的功能是分别将新闻的label和正文向量化;
部分单词不用编码；

本例只是做向量化，未做进一步处理。向量化是为后面的训练做准备。

您可能感兴趣的与本文相关的镜像