理解语言模型：从一元到N元语法-优快云博客

本文链接：https://blog.youkuaiyun.com/chenjunheaixuexi/article/details/125525498

本文介绍了语言模型的基本概念，探讨了如何用n元语法估计联合概率，以及其在预训练模型、文本生成和序列判断等场景的应用。通过马尔科夫假设简化长序列计算，并详细阐述了一元语法和二元语法的计算方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

给定文本序列 $x_1,...,x_T$ ,语言模型的目标是估计联合概率 $p(x_1,...,x_T)$
它的应用包括
- 做预训练模型（BERT, GPT-3）
- 生成文本，给定前几个词，不断使用 $x_t\sim p(x_t|x_1,...,x_{t-1})$ 来生成后续文本
- 判断多个序列中哪个更常见（语音识别发音相似的词句、打字时补全提示以及纠错）

使用计数来建模

假设序列长度为2，我们预测 $p(x,x^{'})=p(x)p(x^{'}|x)=\frac{n(x)}{n}\frac{n(x,x^{'})}{n(x)}$
- 其中 $n$ 代表总词数， $n (x)$ 表示词 $x$ 出现的次数， $n(x,x^{'})$ 为单词 $x$ 和 $x^{'}$ 连续出现的概率
很容易拓展到长为3的情况
$p(x,x^{'},x^{''})=p(x)p(x^{'}|x)p(x^{''}|x,x^{'})=\frac{n(x)}{n}\frac{n(x,x^{'})}{n(x)}\frac{n(x,x^{'},x^{''})}{n(x,x^{'})}$

N元语法

当序列很长时，又会出现难以计算的情况，因此可以使用马尔科夫假设缓解这个问题
这里我们假设序列长度为4
- 一元语法（假设所有词的出现是相互独立的）： $p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2)p(x_3)p(x_4)=\frac{n(x_1)}{n}\frac{n(x_2)}{n}\frac{n(x_3)}{n}\frac{n(x_4)}{n}$
- 二元语法（词的出现跟前一个词有关）： $p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_3)=\frac{n(x_1)}{n}\frac{n(x_1,x_2)}{n}\frac{n(x_2,x_3)}{n}\frac{n(x_3,x_4)}{n}$
- 因此就可以扩展到 N元语法

总结

语言模型估计文本序列的联合概率
使用统计方法时常采用n元语法

import random
import torch
from d2l import torch as d2l

tokens = d2l.tokenize(d2l.read_time_machine())
corpus = [token for line in tokens for token in line]
vocab = d2l.Vocab(corpus)
vocab.token_freqs[:10]


[('the', 2261),
 ('i', 1267),
 ('and', 1245),
 ('of', 1155),
 ('a', 816),
 ('to', 695),
 ('was', 552),
 ('in', 541),
 ('that', 443),
 ('my', 440)]

# 统计单个单词出现的频率
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
         xscale='log', yscale='log')

在这里插入图片描述

bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
bigram_vocab = d2l.Vocab(bigram_tokens)
bigram_vocab.token_freqs[:10]

[(('of', 'the'), 309),
 (('in', 'the'), 169),
 (('i', 'had'), 130),
 (('i', 'was'), 112),
 (('and', 'the'), 109),
 (('the', 'time'), 102),
 (('it', 'was'), 99),
 (('to', 'the'), 85),
 (('as', 'i'), 78),
 (('of', 'a'), 73)]

# 这里我们并没有去除停用词，可以看到三元词开始如果去除停用词高频词都是有具体意义的
trigram_tokens = [triple for triple in zip(
    corpus[:-2], corpus[1:-1], corpus[2:])]
trigram_vocab = d2l.Vocab(trigram_tokens)
trigram_vocab.token_freqs[:10]

[(('the', 'time', 'traveller'), 59),
 (('the', 'time', 'machine'), 30),
 (('the', 'medical', 'man'), 24),
 (('it', 'seemed', 'to'), 16),
 (('it', 'was', 'a'), 15),
 (('here', 'and', 'there'), 15),
 (('seemed', 'to', 'me'), 14),
 (('i', 'did', 'not'), 14),
 (('i', 'saw', 'the'), 13),
 (('i', 'began', 'to'), 13)]

bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
         ylabel='frequency: n(x)', xscale='log', yscale='log',
         legend=['unigram', 'bigram', 'trigram'])