【TF2.0-NLP】Hello world(单词向量化)

本文深入探讨了使用TensorFlow库进行文本预处理的方法,包括单词向量化、文本序列化及序列填充技巧,展示了如何利用Tokenizer和pad_sequences实现文本数据的标准化,为后续的机器学习和深度学习任务奠定基础。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

单词向量化

【例1】

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

【输出】

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}

注意:词汇中有i 和I,但编码结果只有i

例2】

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

【输出】

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

解析:比例1多了一个句子,但总共只增加了一个词you,向量化结果相应的多了{'you': 6}这个编码

注意:新增的句子结尾有!号,但!号并没有编码。
 

文本序列化

单词向量化的结果是得到一个词汇表——可以理解为一本字典,每个单词用一个数字编码。在这基础上,可以把文本转成用数字序列。

【例3】

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

【输出】

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

解析:1.把所有句子中出现过的单词向量化;

           2.把4个句子分别转换成用4个向量

【例4】用将新句子序列化(包含新词)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(word_index)

test_data = [
    "i really love my dog",
    "my dog loves my manatee"
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

【输出】

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [1, 3, 1]]

解析:

 'i really love my dog' ==>[4, 2, 1, 3]

其中单词'really'不在词汇中,被直接忽略了。

'my dog loves my manatee' ==>[1, 3, 1]

其中单词'loves'、'manatee'不在词汇中,也被忽略了。

原句中的单词被无情忽略不太好吧,应该有更好的处理办法:

【例5】


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index


print(word_index)

test_data = [
    "i really love my dog",
    "my dog loves my manatee"
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

【输出】

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

解析:加入了oov_token词汇,oov是out of vocabulary的缩写,意思是如果遇到不在词汇中的单词将用<OOV>来代替。其中"<OOV>"是可以随便取的,但不能跟已有词汇冲突。

填充

为了方便训练,需要把训练数据中的句子统一长度,此时短句填充些东西。

【例6】


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences)

【输出】

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]

解析:

    'i love my dog' ==> [5, 3, 2, 4] ==>[ 0  0  0  5  3  2  4]

可以看出:

  •     默认是从前面开始填充的
  •     句子长度是等于所有句子中最长的句子的长度

其实也可以从后面填充,用pad_sequences中的pdding参数指定:

padded = pad_sequences(sequences, padding='post')

此外你还可以指定编码后句子最大的长度:

padded = pad_sequences(sequences, padding='post', maxlen=5)

 但是问题来了,如果我的句子比指定的长,那么势必会丢失信息,关键是从哪里丢。默认是从前面开始,但也可以通过truncating来指定:

padded = pad_sequences(sequences, padding='post', truncating='post',maxlen=5)

【来个稍微复杂一点的例子】

   先下载sarcasm_t.json文件


import json

with open("D:/tmp/sarcasm_t.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []
urls = []

for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

print(sentences[2])
print(padded[2])
print(padded.shape)

【输出】

[   4 8435 3338 2746   22    2  166 8436  416 3113    6  258    9 1002
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]
(26710, 40)

探索BBC新闻

【例7】将BBC新闻分类及正文分别向量化

'''
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O /tmp/bbc-text.csv

'''

import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but",
             "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here",
             "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more",
             "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should",
             "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've",
             "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's",
             "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]


sentences = []
labels = []
with open("bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentence = row[1]
        for word in stopwords:
            token = " " + word + " "
            sentence = sentence.replace(token, " ")
            sentence = sentence.replace("  ", " ")
        sentences.append(sentence)


print(len(sentences))
print(sentences[0])


tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)
print(label_seq)
print(label_word_index)

解析:

  1.  从https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv下载BBC新闻;
  2. 新闻分athletics, cricket, football, rugby, tennis类;
  3. 文件bbc-text.csv,每行一条新闻,格式是:<label>,<正文>    其中label就是第2点中的新闻分类;
  4. 本例的功能是分别将新闻的label和正文向量化;
  5. 部分单词不用编码;

本例只是做向量化,未做进一步处理。向量化是为后面的训练做准备。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值