word2vec代码详解(3)-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_41841797/article/details/84194430

步骤四，我们可以建立和训练一个Skip-Gram模型了

batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.单词的维度
skip_window = 1  # How many words to consider left and right.
num_skips = 2  # How many times to reuse an input to generate a label.目标单词提取的样本数
num_sampled = 64  # Number of negative examples to sample.目标单词的负样本数量

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. These 3 variables are used only for
# displaying model accuracy, they don't affect calculation.
valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)

valid_examples=np.random.choice(valid_window,valid_size,replace=False)从0到100中随机选择了16个整数——这些整数与文本数据中最常用的100个单词的整数索引相对应。

graph = tf.Graph()

with graph.as_default():

  # Input data.
    with tf.name_scope('inputs'):
        train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
        train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
        valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

  # Ops and variables pinned to the CPU because of missing GPU implementation
    with tf.device('/cpu:0'):
    # Look up embeddings for inputs.
        with tf.name_scope('embeddings'):
          embeddings = tf.Variable(   #随机生成所有单词的词向量
              tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
          embed = tf.nn.embedding_lookup(embeddings, train_inputs) #输入单词的词向量

    # Construct the variables for the NCE loss
    with tf.name_scope('weights'):
        nce_weights = tf.Variable(
            tf.truncated_normal(
              [vocabulary_size, embedding_size],
              stddev=1.0 / math.sqrt(embedding_size)))
    with tf.name_scope('biases'):
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  # Compute the average NCE loss for the batch.
  # tf.nce_loss automatically draws a new sample of the negative labels each
  # time we evaluate the loss.
  # Explanation of the meaning of NCE loss:
  #   http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
    with tf.name_scope('loss'):
        loss = tf.reduce_mean(
            tf.nn.nce_loss(
                weights=nce_weights,
                biases=nce_biases,
                labels=train_labels,
                inputs=embed,
                num_sampled=num_sampled,
                num_classes=vocabulary_size))

  # Add the loss value as a scalar to summary.
    tf.summary.scalar('loss', loss)

  # Construct the SGD optimizer using a learning rate of 1.0.
    with tf.name_scope('optimizer'):
        optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

  # Compute the cosine similarity between minibatch examples and all embeddings.
  # 所有单词词向量标准化（除以L2范数）
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,
                                            valid_dataset)
    #计算单词的相似性
    similarity = tf.matmul(
        valid_embeddings, normalized_embeddings, transpose_b=True)

  # Merge all summaries.
    merged = tf.summary.merge_all()

  # Add variable initializer.
    init = tf.global_variables_initializer()

  # Create a saver.
    saver = tf.train.Saver()

nce_loss函数提供了，取negtive sample，计算loss一条龙服务，相当方便。具体见https://blog.youkuaiyun.com/u012436149/article/details/52848013