Chapter 3 N-gram Language Models

最新推荐文章于 2024-09-15 15:01:59 发布

boywaiter

最新推荐文章于 2024-09-15 15:01:59 发布

阅读量856

点赞数 1

分类专栏： Speech and Language Processing ed3 文章标签： NLP

本文链接：https://blog.youkuaiyun.com/boywaiter/article/details/89636356

版权

本文是《语音和语言处理》第三章的学习笔记，主要介绍了N-gram语言模型及其应用。N-gram模型用于为句子和单词序列分配概率，是语言模型的基础。章节涵盖了从N-gram的概率计算到平滑技术，如Laplace平滑和Kneser-Ney平滑。还讨论了评估语言模型的重要指标——困惑度，以及其与熵的关系。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Chapter 3 N-gram Language Models

Speech and Language Processing ed3 读书笔记

Probabilities are essential in

any task in which we have to identify words in noisy, ambiguous input, like speech recognition or handwriting recognition.
spelling correction, which need to find and correct spelling errors
machine translation
augmentative communication systems

Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram. This model is thus called n-gram model.

An n-gram is a sequence of N n-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

3.1 N-Grams

Let’s begin with the task of computing $P (w ∣ h)$ , the probability of a word $w$ given some history $h$ . Suppose the history $h$ is “its water is so transparent that” and we want to know the probability that the next word is the:
$P(the|its\ water\ is\ so\ transparent\ that).$
One way to estimate this probability is from relative frequency counts:
$P(the|its\ water\ is\ so\ transparent\ that)=\\ \frac{C(its\ water\ is\ so\ transparent\ that\ the)}{C(its\ water\ is\ so\ transparent\ that)}$
Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its water is so transparent?” That seems rather a lot to estimate!

For this reason, we’ll need to introduce cleverer ways of estimating the probability of a word $w$ given a history $h$ , or the probability of an entire word sequence $W$ . Let’s start with a little formalizing of notation. To represent the probability of a particular random variable $X_i$ taking on the value “the”, or $P(X_i = “the”)$ , we will use the simplification $P (t h e)$ . We’ll represent a sequence of $N$ words either as $w_1\ldots w_n$ or $w_1^n$ (so the expression $w_1^{n-1}$ means the string $w_1,w_2,\ldots, w_{n-1}$ ). For the joint probability of each word in a sequence having a particular value $w_1,Y = w_2, Z = w_3,\ldots, W = w_n)$ we’ll use $P(w_1,w_2, \ldots, w_n)$ .

Now how can we compute probabilities of entire sequences like $P(w_1,w_2, \ldots, w_n)$ ? One thing we can do is decompose this probability using the chain rule of probability:
$P(X_1X_2\ldots X_n)=P(X_1)P(X_2|X_1)P(X_3|X_1^2)\ldots P(X_n|X_1^{n-1}) \\=\prod_{k=1}^{n}P(X_k|X_1^{k-1})$
Applying the chain rule to words, we get
$P(w_1^n)=P(w_1)P(w_2|w_1)P(w_3|w_1^2)\ldots P(w_n|w_1^{n-1})\\=\prod_{k=1}^n P(w_k|w_1^{k-1})$
The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

The bigram model, for example, approximates the probability of a word given all the previous words $P(w_n|w_1^{n-1})$ by using only the conditional probability of the preceding word $P(w_n|w_{n-1})$ .
$P(w_n|w_1^{n-1})\approx P(w_n|w_{n-1})$
The assumption that the probability of a word depends only on the previous word is called a Markov assumption. The general equation for n-gram approximation to the conditional probability of the next word in a sequence is
$P(w_n|w_1^{n-1})\approx P(w_n|w_{n-N+1}^{n-1})$
Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence
$P(w_1^n)=\prod_{k=1}^nP(w_k|w_{k-1})$
How do we estimate these bigram or n-gram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get the MLE estimate for the parameters of an n-gram model by getting counts from a normalize corpus, and normalizing the counts so that they lie between 0 and 1.

For example, to compute a particular bigram probability of a word $y$ given a previous word $x$ , we’ll compute the count of the bigram $C (x y)$ and normalize by the sum of all the bigrams that share the same first word $x$ :
$P(w_n|w_{n−1}) = \frac{C(w_{n-1}w_n)}{\sum_wC(w_{n-1}w)}$

最低0.47元/天解锁文章