Chapter 3 N-gram Language Models

本文是《语音和语言处理》第三章的学习笔记,主要介绍了N-gram语言模型及其应用。N-gram模型用于为句子和单词序列分配概率,是语言模型的基础。章节涵盖了从N-gram的概率计算到平滑技术,如Laplace平滑和Kneser-Ney平滑。还讨论了评估语言模型的重要指标——困惑度,以及其与熵的关系。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Chapter 3 N-gram Language Models

Speech and Language Processing ed3 读书笔记

Probabilities are essential in

  • any task in which we have to identify words in noisy, ambiguous input, like speech recognition or handwriting recognition.
  • spelling correction, which need to find and correct spelling errors
  • machine translation
  • augmentative communication systems

Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram. This model is thus called n-gram model.

An n-gram is a sequence of N n-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

3.1 N-Grams

Let’s begin with the task of computing P ( w ∣ h ) P(w|h) P(wh), the probability of a word w w w given some history h h h. Suppose the history h h h is “its water is so transparent that” and we want to know the probability that the next word is the:
P ( t h e ∣ i t s   w a t e r   i s   s o   t r a n s p a r e n t   t h a t ) . P(the|its\ water\ is\ so\ transparent\ that). P(theits water is so transparent that).
One way to estimate this probability is from relative frequency counts:
P ( t h e ∣ i t s   w a t e r   i s   s o   t r a n s p a r e n t   t h a t ) = C ( i t s   w a t e r   i s   s o   t r a n s p a r e n t   t h a t   t h e ) C ( i t s   w a t e r   i s   s o   t r a n s p a r e n t   t h a t ) P(the|its\ water\ is\ so\ transparent\ that)=\\ \frac{C(its\ water\ is\ so\ transparent\ that\ the)}{C(its\ water\ is\ so\ transparent\ that)} P(theits water is so transparent that)=C(its water is so transparent that)C(its water is so transparent that the)
Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its water is so transparent?” That seems rather a lot to estimate!

For this reason, we’ll need to introduce cleverer ways of estimating the probability of a word w w w given a history h h h, or the probability of an entire word sequence W W W. Let’s start with a little formalizing of notation. To represent the probability of a particular random variable X i X_i Xi taking on the value “the”, or P ( X i = “ t h e ” ) P(X_i = “the”) P(Xi=the), we will use the simplification P ( t h e ) P(the) P(the). We’ll represent a sequence of N N N words either as w 1 … w n w_1\ldots w_n w1wn or w 1 n w_1^n w1n (so the expression w 1 n − 1 w_1^{n-1} w1n1 means the string w 1 , w 2 , … , w n − 1 w_1,w_2,\ldots, w_{n-1} w1,w2,,wn1). For the joint probability of each word in a sequence having a particular value P ( X = w 1 , Y = w 2 , Z = w 3 , … , W = w n ) P(X = w_1,Y = w_2, Z = w_3,\ldots, W = w_n) P(X=w1,Y=w2,Z=w3,,W=wn) we’ll use P ( w 1 , w 2 , … , w n ) P(w_1,w_2, \ldots, w_n) P(w1,w2,,wn).

Now how can we compute probabilities of entire sequences like P ( w 1 , w 2 , … , w n ) P(w_1,w_2, \ldots, w_n) P(w1,w2,,wn)? One thing we can do is decompose this probability using the chain rule of probability:
P ( X 1 X 2 … X n ) = P ( X 1 ) P ( X 2 ∣ X 1 ) P ( X 3 ∣ X 1 2 ) … P ( X n ∣ X 1 n − 1 ) = ∏ k = 1 n P ( X k ∣ X 1 k − 1 ) P(X_1X_2\ldots X_n)=P(X_1)P(X_2|X_1)P(X_3|X_1^2)\ldots P(X_n|X_1^{n-1}) \\=\prod_{k=1}^{n}P(X_k|X_1^{k-1}) P(X1X2Xn)=P(X1)P(X2X1)P(X3X12)P(XnX1n1)=k=1nP(XkX1k1)
Applying the chain rule to words, we get
P ( w 1 n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 2 ) … P ( w n ∣ w 1 n − 1 ) = ∏ k = 1 n P ( w k ∣ w 1 k − 1 ) P(w_1^n)=P(w_1)P(w_2|w_1)P(w_3|w_1^2)\ldots P(w_n|w_1^{n-1})\\=\prod_{k=1}^n P(w_k|w_1^{k-1}) P(w1n)=P(w1)P(w2w1)P(w3w12)P(wnw1n1)=k=1nP(wkw1k1)
The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

The bigram model, for example, approximates the probability of a word given all the previous words P ( w n ∣ w 1 n − 1 ) P(w_n|w_1^{n-1}) P(wnw1n1) by using only the conditional probability of the preceding word P ( w n ∣ w n − 1 ) P(w_n|w_{n-1}) P(wnwn1).
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P(w_n|w_1^{n-1})\approx P(w_n|w_{n-1}) P(wnw1n1)P(wnwn1)
The assumption that the probability of a word depends only on the previous word is called a Markov assumption. The general equation for n-gram approximation to the conditional probability of the next word in a sequence is
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P(w_n|w_1^{n-1})\approx P(w_n|w_{n-N+1}^{n-1}) P(wnw1n1)P(wnwnN+1n1)
Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence
P ( w 1 n ) = ∏ k = 1 n P ( w k ∣ w k − 1 ) P(w_1^n)=\prod_{k=1}^nP(w_k|w_{k-1}) P(w1n)=k=1nP(wkwk1)
How do we estimate these bigram or n-gram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get the MLE estimate for the parameters of an n-gram model by getting counts from a normalize corpus, and normalizing the counts so that they lie between 0 and 1.

For example, to compute a particular bigram probability of a word y y y given a previous word x x x, we’ll compute the count of the bigram C ( x y ) C(xy) C(xy) and normalize by the sum of all the bigrams that share the same first word x x x:
P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) ∑ w C ( w n − 1 w ) P(w_n|w_{n−1}) = \frac{C(w_{n-1}w_n)}{\sum_wC(w_{n-1}w)} P(wnwn1)=wC(wn1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值