tensorflow中mask

在训练LSTM时,面对不同长度的输入序列,通常需要填充到固定长度。在TensorFlow中,可以通过添加“NUL”符号来实现。然而,这可能导致模型学习到填充符号的行为,影响泛化能力。解决办法是在计算成本时应用Mask,忽略填充部分。在序列到序列模型中,可以找到许多这样的例子,如TensorFlow官方的seq2seq教程中所介绍的‘bucketing and padding’。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

    #loss的shape=[batch,num_step],seqlen的shape = [batch]
    def mask(self, loss, seqlen): 
        mask = tf.sequence_mask(seqlen, maxlen=config.num_steps, dtype=tf.int32)
        clear_loss = np.sum(loss * mask) / np.sum(mask)
        return clear_loss

该段代码的实现是受下面问答的启发:

Question:
Hi,

Say I want to train some LSTM unit, and my training data has variable lengths with a maximum length of say, 30.
What is the right thing to do?

In TF we cannot dynamically create a computation graph of varied lengths, so the number of LSTM unrolling is fixed.
So do we have to pad everything to have a length of 30?

Let’s say my input is a sequence of symbols from a certain alphabet, do I have to add a “NUL” symbol to my alphabet, so that my input now looks like:
w1, w2, … wn, NUL, NUL, NUL, NUL…

This is what I am doing now. However I think this is wrong as the LSTM now will learn some additional behaviours when consuming the (artificial) NUL symbol.
I’m worried that models trained this way won’t be able to generalize well when the length is not bound to 30.

Thanks!
–evan

Answer:
for transduction problems (1:1 between sequence input and target) the general approach i think is to allow the RNN to run over these NUL values but then you apply a mask to zero out the cost associated with them.

eg for sequence [w1, w2, w3, NUL, NUL, NUL]
you first calculate the per element costs, say, costs = [3.1, 4.1, 5.9, 2.6, 5.3, 5.8]

usually you’d take the mean; np.mean(costs) = 4.8, but in this case you don’t care about the last three.

so now you’ll maintain a mask, 0 for NUL and 1 otherwise, mask = [1,1,1,0,0,0]
and you’ll calculate your sequence cost using this mask to zero out the costs you don’t care about;
sequence_cost = np.sum(costs * mask) / np.sum(mask)
(note! NOT np.mean(costs * mask) since the effective sequence “length” has changed from 6 to 3)

it’s “wasteful” in the sense you’re doing more work than the unpadded version but the argument is the denser packed data makes up for in the speed up of the lower level libraries

there are lots of examples of this in the tensorflow seq2seq models
see http://www.tensorflow.org/tutorials/seq2seq/index.html “bucketing and padding” for the high level view of this (+ the extended idea of bucketing)
and https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/seq2seq.py for more detail in code

原贴地址:https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/wk8sbFGyfHA

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值