该文件是一个文本分类的示例,
数据集采用路透社的新闻,我们可以把其内容解析一下:
word_index = reuters.get_word_index()
word_index = {k: (v + 3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["?"] = 2 # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
for i in range(10):
print(decode_review(x_train[i]))
print(y_train[i])
就可以看到类似这样的新闻原文:
<START> ? ? said as a result of its december acquisition of ? co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and ? ? revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash ? per share this year should be 2 50 to three dlrs reuter 3
上面的3,是因为 0,1,2,3保留下标,分别表示:“padding,” “start of sequence,“,“unknown.”和“unused”
然后下面采用 Tokenizer 对其进行编码,我们添加一些打印,看下编码前后的关系:
tmp_list = x_train[0]
tmp_list.sort()
print(tmp_list)
print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])
打印结果为:
[1, 2, 2, 2, 2, 2, 2, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 10, 11, 11, 11, 11, 12, 15, 15, 15, 15, 15, 15, 16, 16, 17, 19, 19, 22, 22, 22, 25, 26, 29, 30, 32, 39, 43, 44, 48, 48, 49, 52, 67, 67, 67, 83, 84, 89, 90, 90, 90, 102, 109, 111, 124, 132, 134, 151, 154, 155, 186, 197, 207, 209, 209, 258, 270, 272, 369, 447, 482, 504, 864]
Vectorizing sequence data...
[0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0.
0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0.
1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
也就是如果文本中包含该字符,则将相应的位置1,否则置0,不过我们也看到,此种编码方式会丢失文本的位置信息;也就是“我要狗”跟“狗咬我”没有区别了,不过好处是节省空间,比起Embedding编码,空间缩小至少一个数量级
编码后的shape为:
x_train shape: (8982, 1000)
y_train shape: (8982, 46)
也就是46种分类,通过一个神经网络来进行分类:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 512512
_________________________________________________________________
activation_1 (Activation) (None, 512) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 46) 23598
_________________________________________________________________
activation_2 (Activation) (None, 46) 0
=================================================================
Total params: 536,110
Trainable params: 536,110
Non-trainable params: 0
_________________________________________________________________
代码 reuters_mlp_relu_vs_selu.py 使用的是同样的数据集,实现的功能就是比较一下两种神经网络,哪种效果好,也就是激活函数不同,dropout不一样,初始化参数方式有点区别,这里仅打印一下神经网络:
network1
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 16) 16016
_________________________________________________________________
activation_1 (Activation) (None, 16) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 16) 272
_________________________________________________________________
activation_2 (Activation) (None, 16) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 16) 0
_________________________________________________________________
dense_3 (Dense) (None, 16) 272
_________________________________________________________________
activation_3 (Activation) (None, 16) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 16) 0
_________________________________________________________________
dense_4 (Dense) (None, 16) 272
_________________________________________________________________
activation_4 (Activation) (None, 16) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 16) 0
_________________________________________________________________
dense_5 (Dense) (None, 16) 272
_________________________________________________________________
activation_5 (Activation) (None, 16) 0
_________________________________________________________________
dropout_5 (Dropout) (None, 16) 0
_________________________________________________________________
dense_6 (Dense) (None, 16) 272
_________________________________________________________________
activation_6 (Activation) (None, 16) 0
_________________________________________________________________
dropout_6 (Dropout) (None, 16) 0
_________________________________________________________________
dense_7 (Dense) (None, 46) 782
_________________________________________________________________
activation_7 (Activation) (None, 46) 0
=================================================================
Total params: 18,158
Trainable params: 18,158
Non-trainable params: 0
_________________________________________________________________
network2
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_1 (Dense) (None, 16) 16016
________________________________________________________________________________
activation_1 (Activation) (None, 16) 0
________________________________________________________________________________
alpha_dropout_1 (AlphaDropout) (None, 16) 0
________________________________________________________________________________
dense_2 (Dense) (None, 16) 272
________________________________________________________________________________
activation_2 (Activation) (None, 16) 0
________________________________________________________________________________
alpha_dropout_2 (AlphaDropout) (None, 16) 0
________________________________________________________________________________
dense_3 (Dense) (None, 16) 272
________________________________________________________________________________
activation_3 (Activation) (None, 16) 0
________________________________________________________________________________
alpha_dropout_3 (AlphaDropout) (None, 16) 0
________________________________________________________________________________
dense_4 (Dense) (None, 16) 272
________________________________________________________________________________
activation_4 (Activation) (None, 16) 0
________________________________________________________________________________
alpha_dropout_4 (AlphaDropout) (None, 16) 0
________________________________________________________________________________
dense_5 (Dense) (None, 16) 272
________________________________________________________________________________
activation_5 (Activation) (None, 16) 0
________________________________________________________________________________
alpha_dropout_5 (AlphaDropout) (None, 16) 0
________________________________________________________________________________
dense_6 (Dense) (None, 16) 272
________________________________________________________________________________
activation_6 (Activation) (None, 16) 0
________________________________________________________________________________
alpha_dropout_6 (AlphaDropout) (None, 16) 0
________________________________________________________________________________
dense_7 (Dense) (None, 46) 782
________________________________________________________________________________
activation_7 (Activation) (None, 46) 0
================================================================================
Total params: 18,158
Trainable params: 18,158
Non-trainable params: 0
________________________________________________________________________________