The two-phase process behind LLMs’ responses

最新推荐文章于 2025-12-24 11:39:07 发布

原创

最新推荐文章于 2025-12-24 11:39:07 发布 · 958 阅读

29 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

This relatively easier post will be the opportunity to warm up by getting back to the basics of the Transformer architecture and text generation using Transformer-based decoders. Most importantly, I will establish the vocabulary I will use throughout the series. I highlight in bold the terms I personally favor. You will in particular learn about the two phases of text generation: the initiation phase and the generation (or decoding) phase.

First, a little Transformer refresher. For simplicity, let’s assume that we process a single sequence at a time (i.e. batch size is 1). In the figure below I pictured the main layers of a vanilla Transformer-based decoder (Figure 1) used to generate an output token from a sequence of input tokens.

Figure 1 —Outline of a Transformer decoder model

Notice that the decoder itself does not output tokens but logits (as many as the vocabulary size). By the way, the last layer outputting the logits is often called the language model head or LM head. Deriving the token from the logits is the job of a heuristic called (token) search strategy, generation strategy or decoding strategy. Common