- ransformer.py
class Transformer(CaptioningModel)
- captioning_model.py
class CaptioningModel(Module):
def beam_search(self, visual: utils.TensorOrSequence, max_len: int, eos_idx: int, beam_size: int, out_size=1,
return_probs=False, **kwargs):
bs = BeamSearch(self, max_len, eos_idx, beam_size)
return bs.apply(visual, out_size, return_probs, **kwargs)
- beam_search.py
def apply(self, visual: utils.TensorOrSequence, out_size=1, return_probs=False, **kwargs):
with self.model.statefulness(self.b_s):
for t in range(self.max_len):
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
in containers.py
@contextmanager
def statefulness(self, batch_size: int):
self.enable_statefulness(batch_size)
try:
yield
finally:
self.disable_statefulness()
def iter(self, t: int, visual: utils.TensorOrSequence, outputs, return_probs, **kwargs):
cur_beam_size = 1 if t == 0 else self.beam_size
word_logprob = self.model.step(t, self.selected_words, visual, None, mode='feedback', **kwargs) # ((batch_size*beam_size), cur_len, vocab_size). word_logprob is the output from decoder
word_logprob = word_logprob.view(self.b_s, cur_beam_size, -1) # (batch_size, beam_size, vocab_size). Take out the token probability of the last time step, that is, the current conditional probability
candidate_logprob = self.seq_logprob + word_logprob # (batch_size, beam_size, vocab_size). Calculating the conditional probability of the sequence. because the log is taken, it can be added directly
‘t’ denotes time step. when time step is 0, each sentence will always begin with a special token 'bos’. Thus, cur_beam_size=1.
cur_beam_size = 1 if t == 0 else self.beam_size
is same as:
if t==0:
cur_beam_size=1
else:
cur_beam_size=self.beam_size
def step(self, t, prev_output, visual, seq, mode='teacher_forcing', **kwargs):
it = None
if mode == 'teacher_forcing':
raise NotImplementedError
elif mode == 'feedback':
if t == 0:
self.enc_output, self.mask_enc = self.encoder(visual)
if isinstance(visual, torch.Tensor):
it = visual.data.new_full((visual.shape[0], 1), self.bos_idx).long()
else:
it = visual[0].data.new_full((visual[0].shape[0], 1), self.bos_idx).long()
else:
it = prev_output
return self.decoder(it, self.enc_output, self.mask_enc)
if t == 0, all the values in “it” are self.bos_idx which is 2. if t == other numbers, ‘it’ will store the previous selected words by beam search method.
C5W3L04 Refining Beam Search https://www.youtube.com/watch?v=gb__z7LlN_4
Why do we use log probability?
- if you are implementing these probabilities are all numbers much less less than 1 and multiplying a lot of numbers less than 1 result in a tiny tiny number which can result in numerical underflow meaning that is too small for the floating-point representation in your computer to store accurately so in practice instead of maximizing this product, we will take logs which is more numerically stable. Then the log of a product becomes a sum of a log and maximizing this sum of log probability should give you the same result in terms of selecting the most likely sentence because the log function is a strictly monotonically increasing function.
why do we use length normalization?
- if you have a very long sentence, the probability is going to be low because you’re multiplying as many terms which are less than 1 together you just tend to end up with a smaller probability. So objective (product one and log one) has a undesirable effect that it may be unnaturally tends to prefer very short sentence beecause the probability of a short sentence is determined just by multiplying fewer of these numbers which are less than 1 and so the product will just be not quite as small. And the same thing is true for the log of a probability that is always less than or equal to 1. You’re actually in this range of the log (blue circle in the slide), so the more terms you have together, the more negative. So there’s one other changeto the algorithm that make it work better. The thing you could do is to normalize this by the number of words in your output sentences and so this takes the average of the log of the probability of each word now. This significantly reduces the penalty for outputting longer sentences.
In practice, instead of dividing by T which is the number of words in the output sentence, we used a softer approach Ty to the power of alpha where 0<=alpha<=1. So if alpha=1, then you’re completely normalizing by length. if alpha=0, then Ty to the power of 0 would be 1, then you’re just not normalizing at all. if alpha is another parameter, then this is somewhere in between for normalization and no normalization.
C5W3L05 Error Analysis of Beam Search https://www.youtube.com/watch?v=ZGUZwk7xIwk
You can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on or whether it might be your RNN model that is causing problems that were spending time by error analysis.