整理一下思路:我的研究课题是词义消歧,读了谷歌大神的论文,用神经网络作词义消歧。然后用Keras还原,后来发现Keras的效果怎么都不好,经过和师兄的交流,师兄严肃地建议我用TensorFlow还原作者的实验。于是开始学习TF,假期在家看了莫凡的教学视频,但是都是一些很基础的东西,读github上的代码还是很吃力,因为不像Keras的汉化做的那么好,TF的各种方法都没有中文的使用说明。
那怎么办呢?跟论文作者联系要一下源码,被告知开源在即,所以转换一下策略,不再急于写出用于实验的代码,而是专心学习一下TF。(老师催就催吧,急于求成等于一事无成),学习这个东西思绪必须要清晰,现准备从头读一篇代码,弄懂其中每一步的意义。直接看英文的说明太懵了,那么多方法也不能一下都记住。所以选择了这个方式。
从这个教程中的代码开始:
http://www.tensorfly.cn/tfdoc/tutorials/recurrent.html
https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb
从if __name__=='__main__'开始:之前一直不知道这句干嘛用的,这句话的意思是:当模块被直接运行时,才运行此段代码下的代码块,如果此模块被导入,就不运行。
下面开始正式的代码了。从程序运行的过程一步步来看。简单的说明在程序代码右侧加注释。如果比较复杂的函数我会在代码下面进行说明。
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Example / benchmark for building a PTB LSTM model.
Trains the model described in:
(Zaremba, et. al.) Recurrent Neural Network Regularization #训练这篇文章中的RNN
http://arxiv.org/abs/1409.2329
There are 3 supported model configurations:
===========================================
| config | epochs | train | valid | test
===========================================
| small | 13 | 37.99 | 121.39 | 115.91
| medium | 39 | 48.45 | 86.16 | 82.07
| large | 55 | 37.87 | 82.62 | 78.29
The exact results may vary depending on the random initialization. #实际结果可能和随机初始化的不同而变化
The hyperparameters used in the model: #模型中使用的参数
- init_scale - the initial scale of the weights
- learning_rate - the initial value of the learning rate
- max_grad_norm - the maximum permissible norm of the gradient #梯度的最大容许标准(不太懂)
- num_layers - the number of LSTM layers
- num_steps - the number of unrolled steps of LSTM #这个指的就是time_step,也就是输入的词的个数
- hidden_size - the number of LSTM units
- max_epoch - the number of epochs trained with the initial learning rate #初始学习效率
- max_max_epoch - the total number of epochs for training
- keep_prob - the probability of keeping weights in the dropout layer #1-dropout
- lr_decay - the decay of the learning rate for each epoch after "max_epoch" #学习效率衰减
- batch_size - the batch size
The data required for this example is in the data/ dir of the
PTB dataset from Tomas Mikolov's webpage:
$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
$ tar xvf simple-examples.tgz
To run:
$ python ptb_word_lm.py --data_path=simple-examples/data/
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
import tensorflow as tf
import reader
#在构建模型和训练之前,我们首先需要设置一些参数。tf中可以使用tf.flags来进行全局的参数设置
flags = tf.flags
logging = tf.logging
flags.DEFINE_string(
"model", "small",
"A type of model. Possible options are: small, medium, large.") #定义变量model的值为小,后面的是注释
flags.DEFINE_string("data_path", None,
"Where the training/test data is stored.") #定义下载好的数据存放位置
flags.DEFINE_string("save_path", None,
"Model output directory.") #是否使用float16格式
flags.DEFINE_bool("use_fp16", False,
"Train using 16-bit floats instead of 32bit floats")
FLAGS = flags.FLAGS # 可以使用FLAGS.model来调用变量 model的值
这有一篇比较好的博文,先看一下再接着写http://www.cnblogs.com/wuzhitj/p/6297992.html
init_scale = 0.1 # 相关参数的初始值为随机均匀分布,范围是[-init_scale,+init_scale]learning_rate = 1.0 # 学习速率,在文本循环次数超过max_epoch以后会逐渐降低max_grad_norm = 5 # 用于控制梯度膨胀,如果梯度向量的L2模超过max_grad_norm,则等比例缩小num_layers = 2 # lstm层数num_steps = 20 # 单个数据中,序列的长度。hidden_size = 200 # 隐藏层中单元数目max_epoch = 4 # epoch<max_epoch时,lr_decay值=1,epoch>max_epoch时,lr_decay逐渐减小max_max_epoch = 13 # 指的是整个文本循环次数。keep_prob = 1.0 # 用于dropout.每批数据输入时神经网络中的每个单元会以1-keep_prob的概率不工作,可以防止过拟合lr_decay = 0.5 # 学习速率衰减batch_size = 20 # 每批数据的规模,每批有20个。vocab_size = 10000 # 词典规模,总共10K个词
if __name__ == "__main__":
tf.app.run()
main中只有这一句,参考这篇博文http://blog.youkuaiyun.com/helei001/article/details/51859423。其实我也不太懂,先往下看。
main函数:
(看代码说明要去英文官网,需要翻墙,)
def main(_):
if not FLAGS.data_path: #如果data.path==None,报错
raise ValueError("Must set --data_path to PTB data directory")
raw_data = reader.ptb_raw_data(FLAGS.data_path)
源码如下:
def ptb_raw_data(data_path=None):
"""Load PTB raw data from data directory "data_path".
Reads PTB text files, converts strings to integer ids,
and performs mini-batching of the inputs.
The PTB dataset comes from Tomas Mikolov's webpage:
http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
Args:
data_path: string path to the directory where simple-examples.tgz has
been extracted.
Returns:
tuple (train_data, valid_data, test_data, vocabulary)
where each of the data objects can be passed to PTBIterator.
"""
#各文件路径和文件名
train_path = os.path.join(data_path, "ptb.train.txt")
valid_path = os.path.join(data_path, "ptb.valid.txt")
test_path = os.path.join(data_path, "ptb.test.txt")
word_to_id = _build_vocab(train_path)
def _build_vocab(filename):
data = _read_words(filename) #将所有的句子中的换行替换为<eos>,然后split(),按顺序返回一个所有词的列表
#如I have a pen . \n 变成['I','have','a','pen','.','<eos>']
def _read_words(filename):
with tf.gfile.GFile(filename, "r") as f:
return f.read().decode("utf-8").replace("\n", "<eos>").split()
counter = collections.Counter(data) #计数,返回一个Counter({'N':2,'<eos>':2}),且由大到小排序
count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))#类型转换,返回一个元组的列表,如[('<eos>',2),('N',2),....],按出现次数由大到小,次数 相同的由小到大排列
words, _ = list(zip(*count_pairs)) #返回两个元组,一个元组装所有的词,一个装所有的出现次数,并且是一一对应关系。
#('<eos>','N')(2,2)
#http://www.cnblogs.com/frydsh/archive/2012/07/10/2585370.html,zip函数说明
word_to_id = dict(zip(words, range(len(words)))) #返回一个字典,{'<eos>':0,'N':1},出现次数越多的词对应的整数越小
return word_to_id
train_data = _file_to_word_ids(train_path, word_to_id)
valid_data = _file_to_word_ids(valid_path, word_to_id)
test_data =_file_to_word_ids(test_path, word_to_id)def _file_to_word_ids(filename, word_to_id):
data = _read_words(filename)
return [word_to_id[word] for word in data if word in word_to_id] #把整个文件的换行替换成<eos>,然后将词的列表替换成对应的id列表。
vocabulary = len(word_to_id) #有多少个词
return train_data, valid_data, test_data, vocabulary #返回各个文件的id列表和字典长度