阿里跨语言短文本匹配算法竞赛总结初试，基于深度学cnn尝试

本次竞赛主要是及基于西班牙语的短文本来判断句子是否相似，目的比较明确，虽然没有取得很好的成绩，得到第44/1027名，可惜离top20还有一段距离；但针对于个人而言，尝试了很多论文方法，感受颇多i，针对于个人而言，需要记录自己的方法总结，以便能够做出改进，和大家分享，有问题指正，讨论，qq号614489362@qq.com，代码将放在github中;

数据介绍，官方数据介绍如下：

数据说明
在本次竞赛中，训练数据集包含两种语言。我们将提供20,000个标注好的英语问句对作为源数据，同时我们也将提供1,400个标注好的西班牙语问句对，以及55,669个未标注的西班牙语问句。所有的标注结果都由语言和领域专家人工标注。与此同时，我们也提供了每种语言的翻译结果。

数据字段
● cikm_english_train: 英语问句对，匹配标注，及其西班牙语翻译。
格式为
英语问句1，西班牙语翻译1，英语问句2，西班牙语翻译2，匹配标注。

标注为1表示两个问句语义相同，0表示不同。

● cikm_spanish_train: 西班牙语问句对，匹配标注，及其英语翻译。
格式为
西班牙语问句1，英语翻译1，西班牙语问句2，英语翻译2，匹配标注。

标注为1表示两个问句语义相同，0表示不同。

● cikm_unlabel_spanish_train: 无标注西班牙语语料，及其英语翻译。

● cikm_test_a: 测试集，需要预测的西班牙语问句对。

不同字段以”\t”符号分隔。

对于数据处理非常重要，但是由于本人经验有限，提供的数据有很多并没有用到，例如没有被标注的数据55,669个这个数据有什么用，也没有引起足够的重视，对于数据处理部分，由于提供的数据较少，并且是不均衡数据，以下是我自己的尝试思路：

直接利用两个句子的共有词作为问题的相似度，这一简单的做法，是的得分当天排名29，目前排名53，说明，共有词特征对于语义表示有很大的作用，但这无法排除，否定词的存在，所以，要用有监督方法，对词语的权重进行调节，loss损失为0.85
我们利用gensim tfidf方法作为词语的权重，利用第一个方法的综合结果。效果更差，可能证明结合的方法不对；衍生方法，可以根据句子特征进行监督学习方法，目前找到的特征有共有词，tfidf ，编辑距离Loss损失为0.95
本来认为lstm是最好的表现形式对于文本语义以及句子结构都能够有很好的表示，所以应该放在最后做处理，所以直接采用了cnn的思想来对语义相似度做处理，这里先把自己的图画出来，比较简单，如下图所示：

整个模型是按照两个句子的独立卷积操作进行的，1 利用竞赛所给的词向量分别对句子对feed进去，每一个句子形成一个矩阵形式 $A_{m,n}$ ,其中m，代表句子长度，n代表embedding的维度，设置的单词跨度为1，2，3，5，但是其实质1，2，3已经差不多可以代表所有短语的大小，通过卷积层以及max_pooling层，获取通过多个不同大小的核的输出，如图所示，每一个维度再求最大值，求最大值的维度要进行说明，max_pooling(x1,x2,...,xm),其中m代表句子长度对应的维度缩减，就是不同filter-size的获取的结果，可以理解为不同短语的一个维度的表示，求得就是不同维度最大短语特征作为整个句子的表示方法。但是不同于上图的方法，本次并没有采用full_connected_layers,因为发现在本次数据处理的基础上，加了这一层效果变得非常差。直接采用coss求两个句子独立的相似度；废话不多说直接从数据处理代码完整代码，然后一一解析，代码放在https://github.com/chenmingwei00/simple_cnn_cossin1.git中。

1。首先应该介绍的是主函数，也就是把模型与数据结合在一块运行的py文件，早这里是git中的cnn_version1.py文件，从头开始说明

初始化函数不多说，主要是对dropout-keep设置，batch_size设置，本次设置为10，原因很简单，虽然提供训练数据共有20000条左右，可是数据1和0标签比例为1：3，所以数据不平衡，而我对数据不平衡也看了很多方法，主要是smote方法以及欠采样方法，接下来会讨论数据不平衡的情况，总结采用欠采样在这里我的代码里效果最好，提取出10000条数据，所以batch-size设置为10效果最好，

num_epochs是迭代次数，每一次也就是batch_size训练，由于本次模型简单，数据较少，一般迭代次数在3000——————8000左右就会收敛，evaluate_every本次没有用到self.embedding_dim表示用到的词向量维度，这列提供词向量为300维度，self.num_filters=400表示卷积之后的map_feature输出的深度维度，self.filter_sizes表示跨越一个句子的宽度，也就是卷积核的window_size的大小self.get_data()其实就是数据准备，由于本次的词向量并没有发生变化，没有在训练时变化，就是准备数据；

f = open('E:/tianchi_tran/my_cnn/competation_data/spanish_vec.pkl', 'rb')

self.words_vec = pickle.load(f)

这两句是自己处理得到的西班牙词语对应的词向量字典

class cnn_version1():
   
    def __init__(self):
        self.dropout_keep_prob=1.0
        self.batch_size=10
        self.num_epochs=500000
        self.evaluate_every=3000
        self.checkpoint_every=3000
        self.embedding_dim=300
        self.num_filters=400
        self.filter_sizes="1,2,3,5"
        self.l2_reg_lambda=0.05
        self.allow_soft_placement=True
        self.log_device_placement=False
        self.get_data()
        f = open('E:/tianchi_tran/my_cnn/competation_data/spanish_vec.pkl', 'rb')
        self.words_vec = pickle.load(f)
        self.model=InsQACNN(
            sequence_length=70,  # in this it is 263 每一个句子长度，本文是按照最大句子长度设定的句子向量长度，是否恰当          有待商榷
            batch_size=self.batch_size,
            vocab_size=len(self.vocab),
            embedding_size=self.embedding_dim,
            filter_sizes=list(map(int, self.filter_sizes.split(","))),
            num_filters=self.num_filters,
            l2_reg_lambda=self.l2_reg_lambda
        )
    #这个就是准备训练数据，此时的方法并没有使用验证数据进行训练指导
    def get_data(self):
        返回的其实就是，处理过后的词语数据
        self.vocab,self.train_y,self.train_x_1,self.train_x_2 = insurance_qa_data_helpers.build_vocab()
        self.test_x1,self.test_x2=insurance_qa_data_helpers.build_vocab_test()

一下为训练函数，其实整个模型比较简单，我就大概对函数的主要句子进行解释

def train(self):
    # 获取输入输出
        session_conf = tf.ConfigProto(
            allow_soft_placement=self.allow_soft_placement,
            log_device_placement=self.log_device_placement)
        sess = tf.Session(config=session_conf)

        sess.run(tf.global_variables_initializer()) #对所有变量初始化
        with sess.as_default():
            saver,checkpoint_prefix,train_summary_writer=self.summary_fun(sess=sess)

            for i in range(self.num_epochs):#开始迭代
                try:
                    start = time.time()
                   #以下函数是随机选择batch-size=10的训练数据集，并且把词语向量根据self.wprd_vec生成数值向量，
                   self.x_train_1的shape=[10,70,300],为什么又要返回self.words_vec因为在词向量中有时候找不到，就随机初始化一个词向量，在下次遇到这些词语，就用之前随机初始化的词语
                    self.x_train_1, self.x_train_2, self.y_train_3 ,self.words_vec= insurance_qa_data_helpers.load_data_temp(self.words_vec,self.train_x_1,self.train_x_2,self.train_y, self.batch_size)
                    #以下过程是为了与模型文件的维度保持一致所做的操作
                    self.x_train_1=np.array([self.x_train_1])
                    self.x_train_1=self.x_train_1.reshape(self.x_train_1.shape[1],self.x_train_1.shape[2],self.x_train_1.shape[3],1)
                    self.x_train_2 = np.array([self.x_train_2])
                    self.x_train_2 = self.x_train_2.reshape(self.x_train_2.shape[1], self.x_train_2.shape[2],
                                                           self.x_train_2.shape[3], 1)
                    self.y_train_3=self.y_train_3.astype(float)
                    # print(self.x_train_1.shape)
                    # print(self.train_y.shape)
                    #把处理好的数据，这时的数据已经feed_dict成词向量矩阵，但是把句子成化成固定的70长度是不科学的，因为本次用到的max_pooling
                    #操作所以只需要把batch_size寻找固定的最大值padding就可以了，但是竞赛已经结束，而且是看到论文
                    #convolution for sentence classfilter 
                    self.train_step(self.x_train_1, self.x_train_2, self.y_train_3, sess=sess,train_summary_writer=train_summary_writer)#
                    end = time.time()
                    current_step = tf.train.global_step(sess, self.model.global_step)
                    #每迭代self.checkpoint_every保存模型
                    if current_step % self.checkpoint_every == 0:
                        path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                        print("Saved model checkpoint to {}\n".format(path))
                except Exception as e:
                    print(e)
                break

接下来最重要的就是要分析的函数是

self.train_step(self.x_train_1,self.x_train_2,self.y_train_3,sess=sess,train_summary_writer=train_summary_writer)#

从下面看这个函数也比较简单，主要是把处理好的数据形成字典与模型图一一对应

def train_step(self,x_batch_1, x_batch_2, y_batch_3,sess,train_summary_writer):#)
    """
    A single training step
    """
    feed_dict = {
        self.model.input_x_1: x_batch_1, #在这里是实际数据
        self.model.input_x_2: x_batch_2,
        self.model.input_y_3: y_batch_3,
        self.model.dropout_keep_prob: self.dropout_keep_prob
    }
    output1,output2=sess.run([self.model.pooled_flat_1,self.model.pooled_flat_2],feed_dict)
    print(np.array(output1).shape)
    print(np.array(output1))
    _, step, summaries, loss, accuracy = sess.run(
        [self.model.train_op, self.model.global_step, self.model.train_summary_op, self.model.loss, self.model.accuracy],
        feed_dict)
    time_str = datetime.datetime.now().isoformat()
    if step%100==0:
        print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
    train_summary_writer.add_summary(summaries, step)

以上函数也没什么可讲的，一看就非常容易懂。上面讲或者代码为实际训练的py文件。

解析来就是自己根据别人代码写的数据处理部分代码；

从上面代码需要数据开始解释

insurance_qa_data_helpers.py

f = open('/home/admin/chen/chen/tianchi_trans/translate_data/spanish_vec.pkl', 'rb')
self.words_vec = pickle.load(f)

需要的数据就是spanish_vec.pkl，这个pkl是西班牙语词语对应的词向量，由于给的原始词向量文件是wiki.es.vec，由于本人见识比较短，所以不知道怎么直接利用，所以写了一个读取文本的函数，写成pkl保存词语与词向量对应的字典，便于加载词向量，具体处理过程中遇见很奇怪vu爱的问题上面已经说明了，这个函数也比较简单，具体如下：

def load_word_vec():
    vec_path="E:/tianchi_tran/my_cnn/competation_data/wiki.es.vec"#原始文件中，一行对应词语以及300维度词向量

    words_vec={} #词语与词向量对应词典
    with open(vec_path,'r',encoding='utf-8') as file: #用 for line in open(vec_path,'r',encoding='utf-8)出现问题不知#道什么原因；
        lines=file.readlines()
        k=0
        for line in lines:
            if k==0:#第一行是总共词向量个数记录
                print(line)
                k+=1
                continue
            wordvec=line.split(" ")#词向量与数值是以“ ”分割
            if len(wordvec[:-1])!=301:#判断这种分割是否正确，如果有大于301（包括词语）则输出，结果没有输出，证明这样分割正#确
                print(len(wordvec))
                print(line)
            word=wordvec[0] #第一个为词语
            vec=[float(ele) for ele in wordvec[1:-1]]#剩余为词向量
            words_vec[word]=vec
            if len(vec)!=300:#再次确定是否是300维度
                print(line)
                print(len(vec))
            k+=1
    vec = np.random.rand(300) - 0.5
    words_vec['ttttt']=vec #此处用了所有句子扩充70，空格用'ttttt'，当然这样做不太好，由于用了max_pooling,所以用         #batch_size最大值就可以了。
    f = open('E:/tianchi_tran/my_cnn/competation_data/spanish_vec.pkl', 'wb')#最终保存字典到spanish_vec.pkl文件
    pickle.dump(words_vec, f)
    f.close()

再接着看cnn_version1.py文件get-data()函数

self.vocab,self.train_y,self.train_x_1,self.train_x_2 = insurance_qa_data_helpers.build_vocab()
用到了build_vocab()函数，这个函数主要就是也比较简单，就是把训练数据分词转化为训练数据，句子分割，便于以后转化为词向量，具体如下：函数写的有点冗余，由于竞赛就没有多进行精炼，有两个文件，一个是英语为主语言，西班牙语为翻译语言，所以我的最后模型不精确，是否是因我用翻译西班牙语为主的原因。因为已经结束，所以没有深究。

def build_vocab():
    """
   简单的把句子进行分割，形成词语数组列表，此时的数据仍然是词语而不是词向量，以及对应的labels
    """
    code = int(0)
    vocab = {} #words correspond it's index about train data and test data for question
    vocab['UNKNOWN'] = code #词语对应的下标
    code += 1
    train_x1=[] #用来保存第一组句子的词语列表
    train_x2=[]#用来保存第二组句子的词语列表 
    train_y=[] #用来存储两个对应句子的labels
    max_len=60 #找到最大句子长度，把所有句子扩充到这个长度，当然这样做有问题
    for line in open(train_path1,encoding='utf-8'):
        english_1,span_1,english_2,span_2,lable = line.strip().split("\t")
        words1= span_1.split()#得到输入问句<a>填充之后的句子
        words2=span_2.split()
        if len(words1)>max_len:
            max_len=len(words1)
        if len(words2)>max_len:
            max_len=len(words2)
        train_x1.append(words1)#把每一个训练句子词语列表加入形成句子矩阵
        train_x2.append(words2)
        for word in words1:
            if not word in vocab:
                vocab[word] = code
                code += 1
        for word in words2:
            if not word in vocab:
                vocab[word] = code
                code += 1
        for tem_labe in lable:
            train_y.append(int(tem_labe))
    for line in open(train_path2,encoding='utf-8'):#这是第二个文件，西班牙语为主要语言，但这种数据较少
        span_1, english_1,  span_2, english_2,lable = line.strip().split("\t")
        # items = line.strip().split('$$$$$') #get each train
        # for i in range(2, 4):
        words1 = span_1.split()  # 得到输入问句<a>填充之后的句子
        words2 = span_2.split()
        if len(words1) > max_len:
            max_len = len(words1)
        if len(words2) > max_len:
            max_len = len(words2)
        train_x1.append(words1)
        train_x2.append(words2)
        for word in words1:
            if not word in vocab:
                vocab[word] = code
                code += 1
        for word in words2:
            if not word in vocab:
                vocab[word] = code
                code += 1
        for tem_labe in lable:
            train_y.append(int(tem_labe))
    print(max_len)
    return vocab,train_y,train_x1,train_x2

self.get_data()的函数代码

self.test_x1,self.test_x2=insurance_qa_data_helpers.build_vocab_test()

使用同样的方法得到词语列表；

接下来就是吧词语列表转化为真的词向量矩阵，每一个矩阵的维度就是70，因为我把他们扩展到70*300，具体函数就是下边这个：

self.x_train_1, self.x_train_2, self.y_train_3 ,self.words_vec= insurance_qa_data_helpers.load_data_temp(self.words_vec,self.train_x_1,self.train_x_2,self.train_y, self.batch_size)看具体的函数实现如下：

def load_data_temp(words_vec,train_x1,train_x2,train_y, size):
    """
    :param words_ve: word  of vector
    :param train_x1: 对应的词语列表
    :param train_x2:
    :param train_y:对应的标记，一对训练数据的标记 
     size=10
    :return:  x_train_1 表示对应的问题的词向量矩阵
              x_train_2表示对应的问题的词向量矩阵
    """
    x_train_1 = []  #转变成词语对应的下标
    x_train_2 = []
    y_train3=[]
    real_sent={}
    all_words=list(words_vec.keys())
    for i in range(0, size):#随机抽取batch_size大小的数据集
        train_oen_x1=[] #当前句子向量,最终其长度均为70
        train_oen_x2=[] #
        train_index=random.randint(0, len(train_y) - 1) #随机选择一个index下标
        temp_train_y = train_y[train_index] #随机选择训练集
        temp_train_x1=train_x1[train_index]
        temp_train_x2=train_x2[train_index]
        for one in temp_train_x1:#对于每一个句子的没一个单词转化为词向量
            try:
                ones = ''.join(re.findall(r'\w',one)).lower() #其实不应该把所有大写字母转化为小写，因为embedding
                                                              #能够训练，发现在卷积分类，英文要比lower()精确度高
                train_oen_x1.append(np.array(words_vec[ones]))#把对应的词向量添加进训练数据集
            except:
                vec=np.random.rand(300) - 0.5 #否则随机初始化向量，但是发现，精度无法提高的重要原因也在于数据处理，比如
                words_vec[ones]=vec           #随机初始化，但是不训练的结果就不好，以后一定要重视。
                train_oen_x1.append(vec)
        for one in temp_train_x2:  # 对于每一个句子的没一个单词转化为词向量，同样的操作
            try:
                ones=''.join(re.findall(r'\w', one)).lower()
                train_oen_x2.append(np.array(words_vec[ones]))
            except:
                vec = np.random.rand(300) - 0.5
                words_vec[ones] = vec
                train_oen_x2.append(vec)
        rest_len_1=70-len(train_oen_x1)  #对于不够70，用‘ttttt’对应的向量替代，也有待考究，因为，可以使用
        rest_len_2 = 70 - len(train_oen_x2)
        for k in range(rest_len_1):
            train_oen_x1.append(np.array(words_vec['ttttt']))
        for k in range(rest_len_2):
            train_oen_x2.append(np.array(words_vec['ttttt']))
        x_train_1.append(train_oen_x1)
        x_train_2.append(train_oen_x2)
        y_train3.append(temp_train_y)
    return np.array(x_train_1),np.array(x_train_2),np.array(y_train3),words_vec#词向量

至此数据处理基本完成；接下来就看模型的基本结构，比较简单，那就从

cnn_model_version1.py这个模型文件讲起；

#与上面的实际数据保持一致，形成占位符，shape保持一致即可
self.input_x_1 = tf.placeholder(tf.float32, [batch_size, sequence_length,embedding_size,1], name="input_x_1")
#待匹配正向问题
self.input_x_2 = tf.placeholder(tf.float32, [batch_size, sequence_length,embedding_size,1], name="input_x_2")
#负向问题
self.input_y_3 = tf.placeholder(tf.float32, [batch_size,], name="input_y_3")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")#drop_out的概率
self.global_step = tf.Variable(-1, trainable=False) #全局变量，非常重要

pooled_outputs_1 = []#不同卷积核的句子represention
pooled_outputs_2 = []
  #下面的for循环就是对两个句子进行卷积操作，然后求不同feature_map的最大数值，作为最后句子的表示。
  for i, filter_size in enumerate(filter_sizes): #filter_size=[1,2,3,5] #不同卷积核的大小
            with tf.name_scope("conv-maxpool-%s" % filter_size): #在不同大小范围内
                filter_shape = [filter_size, embedding_size,1, num_filters] #num_filter=400 卷积核的大小 filter_size 表示卷积核跨越词的个数
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
                conv = tf.nn.conv2d( 
                    self.input_x_1,
                    W,
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="conv-1"
                )
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu-1")
                pooled = tf.nn.max_pool( #对句子进行最大值pooling，其中是对应句子长度每一个维度的最大数值，假设卷积之后feature_map为[70，400]，这里得到最大值结果为400维度,也很明显，求解的就是句子最重要的特征。
                    h,
                    ksize=[1, sequence_length - filter_size + 1, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="poll-1"
                )
                pooled_outputs_1.append(pooled) #总共有四个核
                #下面是相同的操作。
                conv = tf.nn.conv2d(
                    self.input_x_2, #对应正确答案的卷积
                    W,
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="conv-2"
                )
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu-2")
                pooled = tf.nn.max_pool(
                    h,
                    ksize=[1, sequence_length - filter_size + 1, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="poll-2"
                )
                pooled_outputs_2.append(pooled)

以上就是得到句子的represention，接下来就是很简单的用表示向量作为相似度的数值，接下来就是相似度的操作。

num_filters_total = num_filters * len(filter_sizes)
pooled_reshape_1 = tf.reshape(tf.concat(pooled_outputs_1, axis=3),[-1,
                                                                   ])  # 把对应的矩阵转化为一行，batch_size为行数[30,1600],就是为了更好地求一个句子的相似度q
pooled_reshape_2 = tf.reshape(tf.concat(pooled_outputs_2, axis=3),[-1, num_filters_total])
pooled_flat_1 = tf.nn.dropout(pooled_reshape_1, self.dropout_keep_prob)  # 这个是最终的向量，可以在这里进行sigmod(q*W*a)
self.pooled_flat_1=pooled_flat_1
pooled_flat_2 = tf.nn.dropout(pooled_reshape_2, self.dropout_keep_prob)#进行dropout，但是发现在这里不管用
pooled_flat_2=tf.transpose(pooled_flat_2,)
similar_weight=tf.Variable(tf.truncated_normal([num_filters_total,num_filters_total], stddev=0.1), name="W")
print(similar_weight)
# c = tf.matmul(self.pooled_flat_1, similar_weight)
# self.final_result=tf.matmul(c,pooled_flat_2)
b_final = tf.Variable(tf.constant(0.1), name="b")
这里直接用cos求相似度，发现在我组织的数据，用softmax效果很差
pooled_len_1 = tf.sqrt(tf.reduce_sum(tf.multiply(pooled_flat_1, pooled_flat_1), 1)) #利用余弦相似度求解
pooled_len_2 = tf.sqrt(tf.reduce_sum(tf.multiply(pooled_flat_2, pooled_flat_2), 1))
pooled_len_3 = tf.sqrt(tf.reduce_sum(tf.multiply(pooled_flat_3, pooled_flat_3), 1))#利用余弦相似度求解
pooled_mul_12 = tf.reduce_sum(tf.multiply(pooled_flat_1, pooled_flat_2), 1) #计算向量的点乘Batch模式
pooled_mul_13 = tf.reduce_sum(tf.multiply(pooled_flat_1, pooled_flat_3), 1)

with tf.name_scope("output"):
    self.cos_12 = tf.div(pooled_mul_12, tf.multiply(pooled_len_1, pooled_len_2), name="scores") #最后相似度的结果

整体简单的模型已经介绍完毕，所有竞赛抹胸也是基于此模型进行的，发现lstm效果比较差，一直不明白什么原因，可能自己构造数据所造成的效果，因为发现模型对于数据非常敏感，所以以后要要足够重视，数据构造。

由于时间原因，接下来只介绍最后的自己范围内比较好的模型。见下一章。

https://mp.youkuaiyun.com/postedit/81988789