下面是从"speech and language processing"这本书中关于HMM的摘要. 感觉从来没有这么透彻地理解过HMM. 这本书里, 把所有语音还是自然语言处理解释通俗易懂. 学习这块, 必读书集. 强烈推荐.
HMMs and MEMMs are both sequence classifiers. A sequence classifier or sequence labeler is a model whose job is to assign some label or class to each unit in a sequence.
Hidden Markov Models
Markov chain
We view Markov chain as a kind of probabilitic graphical model; a way of representing probabilistic assumptions in a graph. A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state is dependent only on previous state.
Markov Assumption: P(qi∣q1…qi−1)=P(qi∣qi−1)P(q_i | q_1…q_{i-1}) = P(q_i|q_{i-1})P(qi∣q1…qi−1)=P(qi∣qi−1)
A Markov chain is specified by following components:
-
Q=q1q2…qNQ = q_1 q_2 …q_NQ=q1q2…qN a set of N states
-
A=a01a02…an1…annA = a_{01} a_{02}…a_{n1}…a_{nn}A=a01a02…an1…ann A transition probability matrix A , each with aija_{ij}aij representing the probability of moving from state iii to state jjj , s.t. Σj=1naij=1  ∀i\Sigma_{j=1}^{n}a_{ij} = 1 \; \forall iΣj=1naij=1∀i
-
q0,qFq_0, q_Fq0,qF A special start state and end state which are not associated with observations
A markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world. A hidden Markov model allows us to talk about both observed events and hidden events that we think of as causal factors in our probabilistic model.
Hidden Markov Models
HMM is specified by the following components:
- Q=q1q2…qNQ = q_1 q_2 … q_NQ=q1q2…qN A set of N states
- A=a01a02…an1…annA = a_{01} a_{02}…a_{n1}…a_{nn}A=a01a02…an1…ann A transition probability matrix A , each with aija_{ij}aij representing the probability of moving from state iii to state jjj , s.t. Σj=1naij=1  ∀i\Sigma_{j=1}^{n}a_{ij} = 1 \; \forall iΣj=1naij=1∀i
- O=o1o2…oTO = o_1 o_2 … o_TO=o1o2…oT A sequence of T observations, each one drawn from a vocabulary V=v1,v2,…,vVV = v_1, v_2, …, v_VV=v1,v2,…,vV
- B=bi(ot)B = b_i(o_t)B=bi(ot) A sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation oto_tot being generated from a state iii
- q0,qFq_0, q_Fq0,qF A special state state and end state which are not associated with observations, together with transition probabilities a01a02…a0na_{01} a_{02}…a_{0n}a01a02…a0n out of the start state and a1Fa2F…anFa_{1F} a_{2F} … a_{nF}a1Fa2F…anF into the end state.
A first-order Hidden Markov Model instantiates two simplifying assumptions:
- Markov Assumption : P(qi∣q1…qi−1)=P(qi∣qi−1)P(q_i | q_1 … q_{i-1}) = P(q_i|q_{i-1})P(qi∣q1…qi−1)=P(qi∣qi−1) as with first-order Markov chain, the probability of a particular state is dependent only on the previous state
- Output Independence Assumption: P(oi∣q1…qi,…,qT,o1,…,oi,….oT)=P(oi∣qi)P(o_i | q_1 … q_i, …, q_T, o_1, …, o_i, …. o_T) = P(o_i|q_i)P(oi∣q1…qi,…,qT,o1,…,oi,….oT)=P(oi∣qi) probability of an output observation oto_tot is dependent only on the state that produced the observation qiq_iqi, and not on any other states or any other observations
Types of HMM:
- Fully-connected or ergodic HMM: there is a non-zero probability of transitioning between any two states
- Bakis HMM : many of the transitions between states have zero probability and the state transitions proceed from left to right
Hidden Markov Models should be characterized by three fundamental problems:
- Problem 1 (Computing Likelihood) : Given an HMM λ=(A,B)\lambda = (A, B)λ=(A,B) and an observation sequence OOO, determine the likelihood P(O∣λ)P(O|\lambda)P(O∣λ)
- Problem 2 (Decoding) : Given an observation sequence OOO and an HMM λ=(A,B)\lambda = (A, B)λ=(A,B), discover the best hidden state sequence QQQ
- Problem 3 (Learning) : Given an observation sequence OOO and the set of states in HMM, learn the HMM parameters AAA and BBB
Computing Likelihood: the Forward Algorithm
An efficient (O(N2T))(O(N^2T))(O(N2T)) algorithm called the forward algorithm is a kind of dynamic programming. It computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis .
Each cell of the forward algorithm trellis αt(j)\alpha_t(j)αt(j) represents the probability of being in state jjj after seeing the first ttt observations, give the automaton λ\lambdaλ :
αt(j)=P(o1,o2,…ot,qt=j∣λ)=∑i=1Nαt−1(i)aijbj(ot)\alpha_t(j) = P(o_1,o_2,…o_t, q_t=j|\lambda) = \sum_{i=1}^N \alpha_{t-1}(i)a_{ij}b_j(o_t)αt(j)=P(o1,o2,…ot,qt=j∣λ)=∑i=1Nαt−1(i)aijbj(ot)
- αt−1(i)\alpha_{t-1}(i)αt−1(i) The previous forward path probability from the previous time step
- aija_{ij}aij The transition probability from previous state qiq_iqi to current state qjq_jqj
- bj(ot)b_j(o_t)bj(ot) The state observation likelihood of the observation symbol oto_tot given the current state jjj
Formal definition of the forward algorithm
- Initialization: α1(j)=a0jbj(o1)    1≤j≤N\alpha_1(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq Nα1(j)=a0jbj(o1)1≤j≤N
- Recursion: αt(j)=∑i=1Nαt−1(i)aijbj(ot);    1≤j≤N,1<t≤T\alpha_t(j) = \sum_{i=1}^N\alpha_{t-1}(i) a_{ij} b_j(o_t); \; \; 1 \leq j \leq N, 1 < t \leq Tαt(j)=∑i=1Nαt−1(i)aijbj(ot);1≤j≤N,1<t≤T
- Termination: P(O∣λ)=αT(qF)=∑i=1NαT(i)aiFP(O|\lambda) = \alpha_T(q_F) = \sum_{i=1}^N\alpha_T(i)a_{iF}P(O∣λ)=αT(qF)=∑i=1NαT(i)aiF
Decoding: the Viterbi Algorithm
For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of observations is called the decoding task.
Decoding : Given as input an HMM λ=(A,B)\lambda=(A,B)λ=(A,B) and a sequence of observations O=o1,o2,…,oTO = o_1, o_2, …, o_TO=o1,o2,…,oT, find the most probable sequence of states Q=q1q2q3…qTQ=q_1 q_2 q_3 … q_TQ=q1q2q3…qT
The most common decoding algorithms for HMMs is the Viterbi algorithm. Like the forward algorithm, Viterbi is a kind of dynamic programming and makes uses of a dynamic programming trellis.
Each cell of the Viterbi trellis, vt(j)v_t(j)vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1,…,qt−1q_0, q_1, …, q_{t-1}q0,q1,…,qt−1, given the automaton λ\lambdaλ :
vt(j)=maxq0,q1,…,qt−1P(q0,q1,…qt−1,o1,o2,…ot,qt=j∣λ)            =maxi=1Nvt−1(i)aijbj(ot)v_t(j) = \max \limits_{q_0,q_1,…,q_{t-1}} P(q_0,q_1,…q_{t-1},o_1,o_2,…o_t,q_t=j|\lambda) \\ \;\;\;\;\;\; = \max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t)vt(j)=q0,q1,…,qt−1maxP(q0,q1,…qt−1,o1,o2,…ot,qt=j∣λ)=maxi=1Nvt−1(i)aijbj(ot)
- vt−1(i)v_{t-1}(i)vt−1(i) The previous Viterbi path probability from the previous time step
- aija_{ij}aij The transition probability from previous state qiq_iqi to current state qjq_jqj
- bj(ot)b_j(o_t)bj(ot) The state observation likelihood of the observation symbol oto_tot given the current state jjj
Note that the Viterbi algorithm is identical to the Forward algorithm except that it takes the max over the previous path probabilities where the forward algorithm takes the sum. The Viterbi algorithm also has back pointers, which will compute the best state sequence by keeping track of the path of hidden states that led to each state, and then at the end tracing back the best path to the beginning (the Viterbi backtrace).
Formal definition of the Viterbi algorithm
-
Initialization:
vi(j)=a0jbj(o1)    1≤j≤Nv_i(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq Nvi(j)=a0jbj(o1)1≤j≤N
bt1(j)=0bt_1(j) = 0bt1(j)=0
-
Recursion (recall states 000 and qFq_FqF are non-emitting)
vt(j)=maxi=1Nvt−1aijbj(ot);    1≤j≤N,1<t≤Tv_t(j) = \max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1< t \leq Tvt(j)=maxi=1Nvt−1aijbj(ot);1≤j≤N,1<t≤T
btt(j)=argmaxi=1Nvt−1aijbj(ot);    1≤j≤N,1<t≤Tbt_t(j) = \arg\max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1 < t \leq Tbtt(j)=argmaxi=1Nvt−1aijbj(ot);1≤j≤N,1<t≤T
-
Termination:
The best score: P∗=vt(qF)=maxi=1NvT(i)∗ai,FP* = v_t(q_F) = \max \limits_{i=1}^{N} v_T(i)*a_{i,F}P∗=vt(qF)=i=1maxNvT(i)∗ai,F
The start of backtrace: qT∗=btT(qF)=argmaxi=1NvT(i)∗ai,Fq_T* = bt_T(q_F) = \arg\max \limits_{i=1}^{N} v_T(i)*a_{i,F}qT∗=btT(qF)=argi=1maxNvT(i)∗ai,F
TRAINING HMMs: The Forward-Backward Algorithm
Learning: Given an observation sequence OOO and the set of possible states in the HMM, learn the HMM parameters AAA and BBB.
The standard algorithm for HMM training is the forward-backward or Baum-Welch algorithm, a special case of Expectation-Maximization or EM algorithm. The algorithm will let us train both the transition probabilities AAA and the emission probabilities BBB of the HMM.
Let us begin by considering the much simpler case of training a Markov chain rather than a HMM. Since states in a Markov chain are observed and it has no emission probabilities BBB, we could view a Markov chain as a degenerate HMM where all the bbb probabilities are 1.0 for the observed symbol and 0 for all other symbols. Thus the only probabilities we need to train are the transition probability matrix AAA.
We get the maximum likelihood estimate of the probability aija_{ij}aij of a particular transition between states III and jjj by counting the number of times the transition was taken, which we could call C(i→j)C(i \rightarrow j)C(i→j), and then normalizing by the total count of all times we took any transition from state iii:
aij=C(i→j)∑q∈QC(i→q)a_{ij} = \frac{C(i \rightarrow j)}{\sum_{q \in Q} C(i \rightarrow q)}aij=∑q∈QC(i→q)C(i→j)
For HMM we cannot compute these counts directly from an observation sequence since we don’t know which path of states was taken through the machine for a given input.
The Baum-Welch algorithm uses two neat intuitions to solve this problem.
- The first idea is to iteratively estimate the counts. We will start with an estimate for the transition and observation probabilities, and then use these estimated probabilities to derive better and better probabilities.
- The second idea is that we get our estimated probabilities by computing the forward probability for an observation and then dividing that probability mass among all the different paths that contributed to this forward probability.
Backword probability
The backward probability β\betaβ is the probability of seeing the observations from time t+1t+1t+1 to the end, give that we are in state jjj at time ttt (give the automaton λ\lambdaλ):
βt(i)=P(ot+1,ot+2…oT∣qt=i,λ)\beta_t(i) = P(o_{t+1},o_{t+2}…o_T|q_t=i,\lambda)βt(i)=P(ot+1,ot+2…oT∣qt=i,λ)
Formal definition of the backward algorithm
-
Initialization: βT(i)=ai,F,    1≤i≤N\beta_T(i) = a_{i,F}, \;\; 1 \leq i \leq NβT(i)=ai,F,1≤i≤N
-
Recursion (again since states 0 and qFq_FqF are non-emitting):
βt(i)=∑j=1Naijbj(ot+1)βt+1(j),    1≤i≤N,1≤t<T\beta_t(i) = \sum \limits_{j=1}^{N}a_{ij}b_j(o_{t+1})\beta_{t+1}(j), \;\; 1 \leq i \leq N, 1 \leq t < Tβt(i)=j=1∑Naijbj(ot+1)βt+1(j),1≤i≤N,1≤t<T
-
Termination:
P(O∣λ)=αT(qF)=β1(0)=∑j=1Na0jbj(o1)β1(j)P(O|\lambda) = \alpha_T(q_F) = \beta_1(0) = \sum \limits_{j=1}^{N}a_{0j}b_j(o_1)\beta_1(j)P(O∣λ)=αT(qF)=β1(0)=j=1∑Na0jbj(o1)β1(j)
We are now ready to understand how the forward and backward probabilities can help us compute the transition probability aija_{ij}aij and the observation probability bi(ot)b_i(o_t)bi(ot) from an observation sequence, even though the actual path taken through the machine is hidden.
Transition Probability Matrix
Let’s begin by showing how to estimate a^ij\hat{a}_{ij}a^ij :
a^ij=expected number of transitions from state i to jexpected number of transitions from state i\LARGE \hat{a}_{ij} = \frac{\text{expected number of transitions from state i to j}}{\text{expected number of transitions from state i}}a^ij=expected number of transitions from state iexpected number of transitions from state i to j
How do we compute the numerator? Here is the intuition. Assumption we had some estimate of the probability that a given transition i→ji \rightarrow ji→j was taken at a particular point in time ttt in the observation sequence. If we know this probability for each particular time ttt, we could sum over all times ttt to estimate the total count for the transition i→ji \rightarrow ji→j.
Formally, let’s define the probability ξt\xi_tξt as the probability of being in state iii ta time ttt and state jjj at time t+1t+1t+1, give the observation sequence and of course the model:
ξt(i,j)=P(qt=i,qt+1=j∣O,λ)=αt(i)aijbj(ot+1)βt+1(j)αT(N)\xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda) =\LARGE \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)}ξt(i,j)=P(qt=i,qt+1=j∣O,λ)=αT(N)αt(i)aijbj(ot+1)βt+1(j)
In detail :
ξt(i,j)=P(qt=i,qt+1=j∣O,λ)not-quite  ξt(i,j)=P(qt=i,qt+1=j,O∣λ)=αt(i)aijbj(ot+1)βt+1(j)P(O∣λ)=αT(N)=βT(1)=∑j=1Nαt(j)βt(j)laws of probability:  P(Q∣O,λ)=P(Q,O∣λ)P(O∣λ)}⇒ξt(i,j)=αt(i)aijbj(ot+1)βt+1(j)αT(N)\left.\begin{matrix} \xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda)\\ \text{not-quite}\; \xi_t(i,j)=P(q_t=i,q_{t+1}=j,O|\lambda)=\alpha_t(i)a_{ij}b_j(o_t+1)\beta_{t+1}(j)\\ P(O|\lambda)=\alpha_T(N)=\beta_T(1)=\sum \limits_{j=1}^{N}\alpha_t(j)\beta_t(j)\\ \text{laws of probability} :\;P(Q|O,\lambda) = \frac{P(Q,O|\lambda)}{P(O|\lambda)} \end{matrix}\right\} \Rightarrow \xi_t(i,j)=\LARGE \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)}ξt(i,j)=P(qt=i,qt+1=j∣O,λ)not-quiteξt(i,j)=P(qt=i,qt+1=j,O∣λ)=αt(i)aijbj(ot+1)βt+1(j)P(O∣λ)=αT(N)=βT(1)=j=1∑Nαt(j)βt(j)laws of probability:P(Q∣O,λ)=P(O∣λ)P(Q,O∣λ)⎭⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎫⇒ξt(i,j)=αT(N)αt(i)aijbj(ot+1)βt+1(j)
The expected number of transitions from state iii to state jjj is then the sum over all ttt of ξ\xiξ , so here is the final formula for α^ij\hat{\alpha}_{ij}α^ij :
α^ij=∑t=1T−1ξt(i,j)∑t=1T−1∑j=1Nξt(i,j) \hat{\alpha}_{ij} = \LARGE \frac{\sum_{t=1}^{T-1}\xi_t(i, j)}{\sum_{t=1}^{T-1}\sum_{j=1}^{N}\xi_t(i, j)}α^ij=∑t=1T−1∑j=1Nξt(i,j)∑t=1T−1ξt(i,j)
Observation Probability Matrix
This is the probability of a given symbol vkv_kvk from the observation vocabulary VVV, given a state jjj : b^j(vk)\hat{b}_j(v_k)b^j(vk).
b^j(vk)=expected number of times in state j and observing symbol vkexpected number of times in state j\LARGE\hat{b}_j(v_k) = \frac{\text{expected number of times in state $j$ and observing symbol $v_k$}}{\text{expected number of times in state $j$}}b^j(vk)=expected number of times in state jexpected number of times in state j and observing symbol vk
For this we will need to know the probability of being in state jjj at time ttt, which we call γt(j)\gamma_t(j)γt(j) :
γt(j)=P(qt=j∣O,λ)=P(qt=j,O∣λ)P(O∣λ)=αt(j)βt(j)P(O∣λ)\gamma_t(j) = P(q_t=j|O,\lambda) = \frac{P(q_t=j, O|\lambda)}{P(O|\lambda)} = \frac{\alpha_t(j)\beta_t(j)}{P(O|\lambda)}γt(j)=P(qt=j∣O,λ)=P(O∣λ)P(qt=j,O∣λ)=P(O∣λ)αt(j)βt(j)
We are ready to compute bbb. For the numerator, we sum γt(j)\gamma_t(j)γt(j) for all time steps ttt in which the observation oto_tot is the symbol vkv_kvk that we are interested in. For the denominator, we sum γt(j)\gamma_t(j)γt(j) over all the time steps ttt. The result will be the percentage of the times that we were in state jjj and we saw symbol vkv_kvk :
b^j(vk)=∑t=1s.t.Ot=vkTγt(j)∑t=1Tγt(j)\hat{b}_j(v_k) = \LARGE \frac{\sum_{t=1s.t. O_t=v_k}^{T}\gamma_t(j)}{\sum_{t=1}^{T}\gamma_t(j)}b^j(vk)=∑t=1Tγt(j)∑t=1s.t.Ot=vkTγt(j)
We now have ways to re-estimate the transition AAA and observation BBB probabilities from an observation sequence OOO assuming that we already have a previous estimate of AAA and BBB.
The Forward-Backward algorithm
The forward-backward algorithm starts with some initial estimate of the HMM parameters λ=(A,B)\lambda=(A, B)λ=(A,B). We then iteratively run two steps. Like other cases of the EM algorithm, the forward-backward algorithm has two steps: the expectation step or E-step and maximization step, or M-step.
In the E-step, we compute the expected state occupancy count γ\gammaγ and the expected state transition count ξ\xiξ, from the earlier AAA and BBB probabilities. In the M-step, we use γ\gammaγ and ξ\xiξ to recompute new AAA and BBB probabilities.
thanks