EncoderStack的三个输入参数encoder_inputs, attention_bias, inputs_padding的用途:
encoder_inputs = self_attention_layer(encoder_inputs, attention_bias)
encoder_inputs = feed_forward_network(encoder_inputs, inputs_padding)
EncoderStack最后输出
self.output_normalization(encoder_inputs)
SelfAttention结构是一个特殊的Attention类,其第1,2参数一样
class SelfAttention(Attention):
"""Multiheaded self-attention layer."""
def call(self, x, bias, cache=None):
return super(SelfAttention, self).call(x, x, bias, cache)
Attention类,实现x对y的attention
生成q,k,v
q = self.q_dense_layer(x)
k = self.k_dense_layer(y)
v = self.v_dense_layer(y)
multi-head (注意这里最终的q,k,v好像是三维数组,[head,x,x],第一维代表一个head,而后面用到的tf.matmul是支持三维数组相乘的)
split_heads输入x
[batch_size, length, hidden_size]
输出
[batch_size, num_heads, length, hidden_size/num_heads]
# Split q, k, v into heads.
q = self.split_heads(q)
k = self.split_heads(k)
v = self.split_heads(v)
scale,除以dk\sqrt {d_k}dk
# Scale q to prevent the dot product between q and k from growing too large.
depth = (self.hidden_size // self.num_heads)
q *= depth ** -0.5
q* (k的转置) 结果shape为[batch_size, num_heads, x_length,y_length ]
logits = tf.matmul(q, k, transpose_b=True)
然后加上bias
logits += bias
softmax,使和为1
weights = tf.nn.softmax(logits, name="attention_weights")
按一定比例组合value
attention_output = tf.matmul(weights, v)
把不同的head结果拼接,转换为split之前的形式
# Recombine heads --> [batch_size, length, hidden_size]
attention_output = self.combine_heads(attention_output)
最后经过[hidden_size,input_size]的权重矩阵输出 [batch_size, length,input_size]
# Run the combined outputs through another linear projection layer.
attention_output = self.output_dense_layer(attention_output)
另外最后的output_dense_layer和形成q , k,v的层都是 tf.layers.Dense,是个全连接网络,节点数为hidden_size