单词向量化
【例1】
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I love my cat',
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
【输出】
{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}
注意:词汇中有i 和I,但编码结果只有i
例2】
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I love my cat',
'You love my dog!'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
【输出】
{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
解析:比例1多了一个句子,但总共只增加了一个词you,向量化结果相应的多了{'you': 6}这个编码
注意:新增的句子结尾有!号,但!号并没有编码。
文本序列化
单词向量化的结果是得到一个词汇表——可以理解为一本字典,每个单词用一个数字编码。在这基础上,可以把文本转成用数字序列。
【例3】
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I love my cat',
'You love my dog',
'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)
【输出】
{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]
解析:1.把所有句子中出现过的单词向量化;
2.把4个句子分别转换成用4个向量
【例4】用将新句子序列化(包含新词)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I love my cat',
'You love my dog',
'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
test_data = [
"i really love my dog",
"my dog loves my manatee"
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)
【输出】
{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [1, 3, 1]]
解析:
'i really love my dog' ==>[4, 2, 1, 3]
其中单词'really'不在词汇中,被直接忽略了。
'my dog loves my manatee' ==>[1, 3, 1]
其中单词'loves'、'manatee'不在词汇中,也被忽略了。
原句中的单词被无情忽略不太好吧,应该有更好的处理办法:
【例5】
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I love my cat',
'You love my dog',
'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
test_data = [
"i really love my dog",
"my dog loves my manatee"
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)
【输出】
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]
解析:加入了oov_token词汇,oov是out of vocabulary的缩写,意思是如果遇到不在词汇中的单词将用<OOV>来代替。其中"<OOV>"是可以随便取的,但不能跟已有词汇冲突。
填充
为了方便训练,需要把训练数据中的句子统一长度,此时短句填充些东西。
【例6】
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [
'i love my dog',
'I love my cat',
'You love my dog',
'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences)
【输出】
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0 0 0 5 3 2 4]
[ 0 0 0 5 3 2 7]
[ 0 0 0 6 3 2 4]
[ 8 6 9 2 4 10 11]]
解析:
'i love my dog' ==> [5, 3, 2, 4] ==>[ 0 0 0 5 3 2 4]
可以看出:
- 默认是从前面开始填充的
- 句子长度是等于所有句子中最长的句子的长度
其实也可以从后面填充,用pad_sequences中的pdding参数指定:
padded = pad_sequences(sequences, padding='post')
此外你还可以指定编码后句子最大的长度:
padded = pad_sequences(sequences, padding='post', maxlen=5)
但是问题来了,如果我的句子比指定的长,那么势必会丢失信息,关键是从哪里丢。默认是从前面开始,但也可以通过truncating来指定:
padded = pad_sequences(sequences, padding='post', truncating='post',maxlen=5)
【来个稍微复杂一点的例子】
import json
with open("D:/tmp/sarcasm_t.json", 'r') as f:
datastore = json.load(f)
sentences = []
labels = []
urls = []
for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])
urls.append(item['article_link'])
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(sentences[2])
print(padded[2])
print(padded.shape)
【输出】
[ 4 8435 3338 2746 22 2 166 8436 416 3113 6 258 9 1002
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
(26710, 40)
探索BBC新闻
【例7】将BBC新闻分类及正文分别向量化
'''
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
-O /tmp/bbc-text.csv
'''
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but",
"by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here",
"here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more",
"most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should",
"so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've",
"this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's",
"which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]
sentences = []
labels = []
with open("bbc-text.csv", 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader)
for row in reader:
labels.append(row[0])
sentence = row[1]
for word in stopwords:
token = " " + word + " "
sentence = sentence.replace(token, " ")
sentence = sentence.replace(" ", " ")
sentences.append(sentence)
print(len(sentences))
print(sentences[0])
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)
print(label_seq)
print(label_word_index)
解析:
- 从https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv下载BBC新闻;
- 新闻分athletics, cricket, football, rugby, tennis类;
- 文件bbc-text.csv,每行一条新闻,格式是:<label>,<正文> 其中label就是第2点中的新闻分类;
- 本例的功能是分别将新闻的label和正文向量化;
- 部分单词不用编码;
本例只是做向量化,未做进一步处理。向量化是为后面的训练做准备。