beam_search.py

最新推荐文章于 2024-07-11 16:46:50 发布

XuMengyaAmy

最新推荐文章于 2024-07-11 16:46:50 发布

阅读量288

点赞数 1

分类专栏：人工智能文章标签：自然语言处理

本文链接：https://blog.youkuaiyun.com/weixin_43332432/article/details/110487240

版权

人工智能专栏收录该内容

10 篇文章

订阅专栏

ransformer.py class Transformer(CaptioningModel)
captioning_model.py

    class CaptioningModel(Module):
	    def beam_search(self, visual: utils.TensorOrSequence, max_len: int, eos_idx: int, beam_size: int, out_size=1,
	                    return_probs=False, **kwargs):
	        bs = BeamSearch(self, max_len, eos_idx, beam_size)
	        return bs.apply(visual, out_size, return_probs, **kwargs)

beam_search.py

def apply(self, visual: utils.TensorOrSequence, out_size=1, return_probs=False, **kwargs):
	        with self.model.statefulness(self.b_s):
	            for t in range(self.max_len):
	                visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)

in containers.py

    @contextmanager
    def statefulness(self, batch_size: int):
        self.enable_statefulness(batch_size)
        try:
            yield
        finally:
            self.disable_statefulness()

    def iter(self, t: int, visual: utils.TensorOrSequence, outputs, return_probs, **kwargs):
        cur_beam_size = 1 if t == 0 else self.beam_size
        word_logprob = self.model.step(t, self.selected_words, visual, None, mode='feedback', **kwargs) # ((batch_size*beam_size), cur_len, vocab_size). word_logprob is the output from decoder
        word_logprob = word_logprob.view(self.b_s, cur_beam_size, -1) # (batch_size, beam_size, vocab_size). Take out the token probability of the last time step, that is, the current conditional probability
        candidate_logprob = self.seq_logprob + word_logprob  # (batch_size, beam_size, vocab_size). Calculating the conditional probability of the sequence. because the log is taken, it can be added directly

‘t’ denotes time step. when time step is 0, each sentence will always begin with a special token 'bos’. Thus, cur_beam_size=1.
cur_beam_size = 1 if t == 0 else self.beam_size is same as:

if t==0:
	cur_beam_size=1
else:
	cur_beam_size=self.beam_size

在这里插入图片描述

    def step(self, t, prev_output, visual, seq, mode='teacher_forcing', **kwargs):
        it = None
        if mode == 'teacher_forcing':
            raise NotImplementedError
        elif mode == 'feedback':
            if t == 0:
                self.enc_output, self.mask_enc = self.encoder(visual)
                if isinstance(visual, torch.Tensor):
                    it = visual.data.new_full((visual.shape[0], 1), self.bos_idx).long()
                else:
                    it = visual[0].data.new_full((visual[0].shape[0], 1), self.bos_idx).long()
            else:
                it = prev_output

        return self.decoder(it, self.enc_output, self.mask_enc)

在这里插入图片描述
if t == 0, all the values in “it” are self.bos_idx which is 2. if t == other numbers, ‘it’ will store the previous selected words by beam search method.

ｃ在这里插入图片描述
C5W3L04 Refining Beam Search https://www.youtube.com/watch?v=gb__z7LlN_4

Why do we use log probability?

if you are implementing these probabilities are all numbers much less less than 1 and multiplying a lot of numbers less than 1 result in a tiny tiny number which can result in numerical underflow meaning that is too small for the floating-point representation in your computer to store accurately so in practice instead of maximizing this product, we will take logs which is more numerically stable. Then the log of a product becomes a sum of a log and maximizing this sum of log probability should give you the same result in terms of selecting the most likely sentence because the log function is a strictly monotonically increasing function.

why do we use length normalization?

if you have a very long sentence, the probability is going to be low because you’re multiplying as many terms which are less than 1 together you just tend to end up with a smaller probability. So objective (product one and log one) has a undesirable effect that it may be unnaturally tends to prefer very short sentence beecause the probability of a short sentence is determined just by multiplying fewer of these numbers which are less than 1 and so the product will just be not quite as small. And the same thing is true for the log of a probability that is always less than or equal to 1. You’re actually in this range of the log (blue circle in the slide), so the more terms you have together, the more negative. So there’s one other changeto the algorithm that make it work better. The thing you could do is to normalize this by the number of words in your output sentences and so this takes the average of the log of the probability of each word now. This significantly reduces the penalty for outputting longer sentences.
In practice, instead of dividing by T which is the number of words in the output sentence, we used a softer approach Ty to the power of alpha where 0<=alpha<=1. So if alpha=1, then you’re completely normalizing by length. if alpha=0, then Ty to the power of 0 would be 1, then you’re just not normalizing at all. if alpha is another parameter, then this is somewhere in between for normalization and no normalization.

在这里插入图片描述

C5W3L05 Error Analysis of Beam Search https://www.youtube.com/watch?v=ZGUZwk7xIwk
You can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on or whether it might be your RNN model that is causing problems that were spending time by error analysis.