Chapter 3 N-gram Language Models
Speech and Language Processing ed3 读书笔记
Probabilities are essential in
- any task in which we have to identify words in noisy, ambiguous input, like speech recognition or handwriting recognition.
- spelling correction, which need to find and correct spelling errors
- machine translation
- augmentative communication systems
Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram. This model is thus called n-gram model.
An n-gram is a sequence of N n-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.
3.1 N-Grams
Let’s begin with the task of computing P ( w ∣ h ) P(w|h) P(w∣h), the probability of a word w w w given some history h h h. Suppose the history h h h is “its water is so transparent that” and we want to know the probability that the next word is the:
P ( t h e ∣ i t s w a t e r i s s o t r a n s p a r e n t t h a t ) . P(the|its\ water\ is\ so\ transparent\ that). P(the∣its water is so transparent that).
One way to estimate this probability is from relative frequency counts:
P ( t h e ∣ i t s w a t e r i s s o t r a n s p a r e n t t h a t ) = C ( i t s w a t e r i s s o t r a n s p a r e n t t h a t t h e ) C ( i t s w a t e r i s s o t r a n s p a r e n t t h a t ) P(the|its\ water\ is\ so\ transparent\ that)=\\ \frac{C(its\ water\ is\ so\ transparent\ that\ the)}{C(its\ water\ is\ so\ transparent\ that)} P(the∣its water is so transparent that)=C(its water is so transparent that)C(its water is so transparent that the)
Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its water is so transparent?” That seems rather a lot to estimate!
For this reason, we’ll need to introduce cleverer ways of estimating the probability of a word w w w given a history h h h, or the probability of an entire word sequence W W W. Let’s start with a little formalizing of notation. To represent the probability of a particular random variable X i X_i Xi taking on the value “the”, or P ( X i = “ t h e ” ) P(X_i = “the”) P(Xi=“the”), we will use the simplification P ( t h e ) P(the) P(the). We’ll represent a sequence of N N N words either as w 1 … w n w_1\ldots w_n w1…wn or w 1 n w_1^n w1n (so the expression w 1 n − 1 w_1^{n-1} w1n−1 means the string w 1 , w 2 , … , w n − 1 w_1,w_2,\ldots, w_{n-1} w1,w2,…,wn−1). For the joint probability of each word in a sequence having a particular value P ( X = w 1 , Y = w 2 , Z = w 3 , … , W = w n ) P(X = w_1,Y = w_2, Z = w_3,\ldots, W = w_n) P(X=w1,Y=w2,Z=w3,…,W=wn) we’ll use P ( w 1 , w 2 , … , w n ) P(w_1,w_2, \ldots, w_n) P(w1,w2,…,wn).
Now how can we compute probabilities of entire sequences like P ( w 1 , w 2 , … , w n ) P(w_1,w_2, \ldots, w_n) P(w1,w2,…,wn)? One thing we can do is decompose this probability using the chain rule of probability:
P ( X 1 X 2 … X n ) = P ( X 1 ) P ( X 2 ∣ X 1 ) P ( X 3 ∣ X 1 2 ) … P ( X n ∣ X 1 n − 1 ) = ∏ k = 1 n P ( X k ∣ X 1 k − 1 ) P(X_1X_2\ldots X_n)=P(X_1)P(X_2|X_1)P(X_3|X_1^2)\ldots P(X_n|X_1^{n-1}) \\=\prod_{k=1}^{n}P(X_k|X_1^{k-1}) P(X1X2…Xn)=P(X1)P(X2∣X1)P(X3∣X12)…P(Xn∣X1n−1)=k=1∏nP(Xk∣X1k−1)
Applying the chain rule to words, we get
P ( w 1 n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 2 ) … P ( w n ∣ w 1 n − 1 ) = ∏ k = 1 n P ( w k ∣ w 1 k − 1 ) P(w_1^n)=P(w_1)P(w_2|w_1)P(w_3|w_1^2)\ldots P(w_n|w_1^{n-1})\\=\prod_{k=1}^n P(w_k|w_1^{k-1}) P(w1n)=P(w1)P(w2∣w1)P(w3∣w12)…P(wn∣w1n−1)=k=1∏nP(wk∣w1k−1)
The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given all the previous words P ( w n ∣ w 1 n − 1 ) P(w_n|w_1^{n-1}) P(wn∣w1n−1) by using only the conditional probability of the preceding word P ( w n ∣ w n − 1 ) P(w_n|w_{n-1}) P(wn∣wn−1).
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P(w_n|w_1^{n-1})\approx P(w_n|w_{n-1}) P(wn∣w1n−1)≈P(wn∣wn−1)
The assumption that the probability of a word depends only on the previous word is called a Markov assumption. The general equation for n-gram approximation to the conditional probability of the next word in a sequence is
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P(w_n|w_1^{n-1})\approx P(w_n|w_{n-N+1}^{n-1}) P(wn∣w1n−1)≈P(wn∣wn−N+1n−1)
Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence
P ( w 1 n ) = ∏ k = 1 n P ( w k ∣ w k − 1 ) P(w_1^n)=\prod_{k=1}^nP(w_k|w_{k-1}) P(w1n)=k=1∏nP(wk∣wk−1)
How do we estimate these bigram or n-gram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get the MLE estimate for the parameters of an n-gram model by getting counts from a normalize corpus, and normalizing the counts so that they lie between 0 and 1.
For example, to compute a particular bigram probability of a word y y y given a previous word x x x, we’ll compute the count of the bigram C ( x y ) C(xy) C(xy) and normalize by the sum of all the bigrams that share the same first word x x x:
P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) ∑ w C ( w n − 1 w ) P(w_n|w_{n−1}) = \frac{C(w_{n-1}w_n)}{\sum_wC(w_{n-1}w)} P(wn∣wn−1)=∑wC(wn−1