This relatively easier post will be the opportunity to warm up by getting back to the basics of the Transformer architecture and text generation using Transformer-based decoders. Most importantly, I will establish the vocabulary I will use throughout the series. I highlight in bold the terms I personally favor. You will in particular learn about the two phases of text generation: the initiation phase and the generation (or decoding) phase.
First, a little Transformer refresher. For simplicity, let’s assume that we process a single sequence at a time (i.e. batch size is 1). In the figure below I pictured the main layers of a vanilla Transformer-based decoder (Figure 1) used to generate an output token from a sequence of input tokens.
Figure 1 —Outline of a Transformer decoder model
Notice that the decoder itself does not output tokens but logits (as many as the vocabulary size). By the way, the last layer outputting the logits is often called the language model head or LM head. Deriving the token from the logits is the job of a heuristic called (token) search strategy, generation strategy or decoding strategy. Common

最低0.47元/天 解锁文章
339

被折叠的 条评论
为什么被折叠?



