【原文转载】Dynamic Programming

本文深入探讨了动态规划法的概念、原理,并通过实例演示如何使用动态规划解决实际问题。文章详细介绍了动态规划的基本思想、状态定义、状态转移方程,并以具体的例子展示了算法的应用过程。读者将了解到如何构建动态规划的解决方案,以及如何通过动态规划优化算法效率。

在网上看到一篇一位外国牛人写的关于动态规划法的文章,现原文转载如下:“

An important part of given problems can be solved with the help of dynamic programming (DP for short). Being able to tackle problems of this type would greatly increase your skill. I will try to help you in understanding how to solve problems using DP. The article is based on examples, because a raw theory is very hard to understand. 

Note: If you're bored reading one section and you already know what's being discussed in it - skip it and go to the next one. 

Introduction (Beginner)

What is a dynamic programming, how can it be described? 

DP is an algorithmic technique which is usually based on a recurrent formula and one (or some) starting states. A sub-solution of the problem is constructed from previously found ones. DP solutions have a polynomial complexity which assures a much faster running time than other techniques like backtracking, brute-force etc. 

Now let's see the base of DP with the help of an example: 

Given a list of N coins, their values (V1V2, ... , VN), and the total sum S. Find the minimum number of coins the sum of which is S (we can use as many coins of one type as we want), or report that it's not possible to select coins in such a way that they sum up to S

Now let's start constructing a DP solution: 

First of all we need to find a state for which an optimal solution is found and with the help of which we can find the optimal solution for the next state. 

What does a "state" stand for? 

It's a way to describe a situation, a sub-solution for the problem. For example a state would be the solution for sum i, where i≤S. A smaller state than state i would be the solution for any sum j, where j<i. For finding a state i, we need to first find all smaller states j (j<i) . Having found the minimum number of coins which sum up to i, we can easily find the next state - the solution for i+1

How can we find it? 

It is simple - for each coin j, Vj≤i, look at the minimum number of coins found for the i-Vjsum (we have already found it previously). Let this number bem. If m+1 is less than the minimum number of coins already found for current sum i, then we write the new result for it. 

For a better understanding let's take this example:
Given coins with values 1, 3, and 5.
And the sum S is set to be 11. 

First of all we mark that for state 0 (sum 0) we have found a solution with a minimum number of 0 coins. We then go to sum 1. First, we mark that we haven't yet found a solution for this one (a value of Infinity would be fine). Then we see that only coin 1 is less than or equal to the current sum. Analyzing it, we see that for sum 1-V1= 0 we have a solution with 0 coins. Because we add one coin to this solution, we'll have a solution with 1 coin for sum 1. It's the only solution yet found for this sum. We write (save) it. Then we proceed to the next state - sum 2. We again see that the only coin which is less or equal to this sum is the first coin, having a value of 1. The optimal solution found for sum (2-1) = 1 is coin 1. This coin 1 plus the first coin will sum up to 2, and thus make a sum of 2 with the help of only 2 coins. This is the best and only solution for sum 2. Now we proceed to sum 3. We now have 2 coins which are to be analyzed - first and second one, having values of 1 and 3. Let's see the first one. There exists a solution for sum 2 (3 - 1) and therefore we can construct from it a solution for sum 3 by adding the first coin to it. Because the best solution for sum 2 that we found has 2 coins, the new solution for sum 3 will have 3 coins. Now let's take the second coin with value equal to 3. The sum for which this coin needs to be added to make 3 , is 0. We know that sum 0 is made up of 0 coins. Thus we can make a sum of 3 with only one coin - 3. We see that it's better than the previous found solution for sum 3 , which was composed of 3 coins. We update it and mark it as having only 1 coin. The same we do for sum 4, and get a solution of 2 coins - 1+3. And so on. 

Pseudocode: 

Set Min[i] equal to Infinity for all of i
Min[0]=0

For i = 1 to S
For j = 0 to N - 1
   If (Vj<=i AND Min[i-Vj]+1<Min[i])
  Then Min[i]=Min[i-Vj]+1
Output Min[S]

Here are the solutions found for all sums: 

SumMin. nr. of coinsCoin value added to a smaller sum to
obtain this sum (it is displayed in brackets)
00-
111 (0)
221 (1)
313 (0)
421 (3)
515 (0)
623 (3)
731 (6)
823 (5)
931 (8)
1025 (5)
1131 (10)



As a result we have found a solution of 3 coins which sum up to 11. 

Additionally, by tracking data about how we got to a certain sum from a previous one, we can find what coins were used in building it. For example: to sum 11 we got by adding the coin with value 1 to a sum of 10. To sum 10 we got from 5. To 5 - from 0. This way we find the coins used: 1, 5 and 5. 

Having understood the basic way a DP is used, we may now see a slightly different approach to it. It involves the change (update) of best solution yet found for a sum i, whenever a better solution for this sum was found. In this case the states aren't calculated consecutively. Let's consider the problem above. Start with having a solution of 0 coins for sum 0. Now let's try to add first coin (with value 1) to all sums already found. If the resulting sum t will be composed of fewer coins than the one previously found - we'll update the solution for it. Then we do the same thing for the second coin, third coin, and so on for the rest of them. For example, we first add coin 1 to sum 0 and get sum 1. Because we haven't yet found a possible way to make a sum of 1 - this is the best solution yet found, and we mark S[1]=1. By adding the same coin to sum 1, we'll get sum 2, thus making S[2]=2. And so on for the first coin. After the first coin is processed, take coin 2 (having a value of 3) and consecutively try to add it to each of the sums already found. Adding it to 0, a sum 3 made up of 1 coin will result. Till now, S[3] has been equal to 3, thus the new solution is better than the previously found one. We update it and mark S[3]=1. After adding the same coin to sum 1, we'll get a sum 4 composed of 2 coins. Previously we found a sum of 4 composed of 4 coins; having now found a better solution we update S[4] to 2. The same thing is done for next sums - each time a better solution is found, the results are updated. 

Elementary

To this point, very simple examples have been discussed. Now let's see how to find a way for passing from one state to another, for harder problems. For that we will introduce a new term called recurrent relation, which makes a connection between a lower and a greater state. 

Let's see how it works: 

Given a sequence of N numbers - A[1] A[2] , ..., A[N] . Find the length of the longest non-decreasing sequence. 

As described above we must first find how to define a "state" which represents a sub-problem and thus we have to find a solution for it. Note that in most cases the states rely on lower states and are independent from greater states. 

Let's define a state i as being the longest non-decreasing sequence which has its last number A[i] . This state carries only data about the length of this sequence. Note that for i<j the state i is independent from j, i.e. doesn't change when we calculate state j. Let's see now how these states are connected to each other. Having found the solutions for all states lower than i, we may now look for state i. At first we initialize it with a solution of 1, which consists only of the i-th number itself. Now for each j<i let's see if it's possible to pass from it to state i. This is possible only when A[j]≤A[i] , thus keeping (assuring) the sequence non-decreasing. So if S[j] (the solution found for state j) + 1 (number A[i] added to this sequence which ends with number A[j] ) is better than a solution found for i (ie. S[j]+1>S[i] ), we make S[i]=S[j]+1. This way we consecutively find the best solutions for each i, until last state N. 

Let's see what happens for a randomly generated sequence: 5, 3, 4, 8, 6, 7: 

IThe length of the longest
non-decreasing sequence
of first i numbers
The last sequence i from
which we "arrived"
to this one
111 (first number itself)
212 (second number itself)
322
433
533
644

”。

原文网址:http://www.topcoder.com/tc?d1=tutorials&d2=dynProg&module=Static

转载于:https://www.cnblogs.com/leizisdu/archive/2011/12/14/2287984.html

【SLP·Python】基于 DTW GMM HMM 三种方法实现的语音分类识别系统 Sherry-XLL 于 2021-08-28 17:27:16 发布 阅读量8.9k 收藏 133 点赞数 22 文章标签: python 语音识别 hmm 机器学习 版权 优快云学习社区 文章已被社区收录 加入社区 本文探讨了在孤立词语音识别中,使用动态时间规整(DTW)、高斯混合模型(GMM)和隐马尔可夫模型(HMM)的方法。通过Python实现,针对10个数字的语音识别任务,对数据集进行预处理、特征提取,训练模型并进行预测。实验结果显示,自定义GMM模型达到了100%的准确率,而DTW和HMM模型则分别达到78%和87.5%。此外,文章提供了相关代码和数据集链接,供读者参考和复现实验结果。 摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 > 展开 前言 Github 项目地址:https://github.com/Sherry-XLL/Digital-Recognition-DTW_HMM_GMM 在孤立词语音识别(Isolated Word Speech Recognition) 中,DTW,GMM 和 HMM 是三种典型的方法: 动态时间规整(DTW, Dyanmic Time Warping) 高斯混合模型(GMM, Gaussian Mixed Model) 隐马尔可夫模型(HMM, Hidden Markov Model) 本文并不介绍这三种方法的基本原理,而是侧重于 Python 版代码的实现,针对一个具体的语音识别任务——10 digits recognition system,分别使用 DTW、GMM 和 HMM 建立对 0~9 十个数字的孤立词语音分类识别模型。 前言 一、音频数据处理 1. 数据集 2. 音频预处理 3. MFCC 特征提取 二、Isolated Speech Recognition - DTW 1. DTW 算法步骤 2. DTW 函数编写 3. 模型训练预测 三、Isolated Speech Recognition - GMM 1. GMM 模型训练流程 2. EM 算法步骤 3. GMM 实现代码 四、Isolated Speech Recognition - HMM 1. 隐马尔可夫模型 2. 代码实现 五、实验结果 1. 实验环境 2. 文件简介 3. 结果对比 六、参考链接 一、音频数据处理 1. 数据集 本文中的语音识别任务是对 0 ~ 9 这十个建立孤立词语言识别模型,所使用的数据集类似于语音版的 MNIST 手写数字数据库。原始数据集是 20 个人录制的数字 0 ~ 9 的音频文件,共 200 条 .wav 格式的录音,同样上传到了 Github 项目 中,分支情况如下: records digit_0 digit_1 digit_2 digit_3 digit_4 digit_5 digit_6 digit_7 digit_8 digit_9 1_0.wav ...... 20_0.wav records 文件夹下包括 digit_0 ~ digit_9 十个子文件夹。每个子文件夹代表一个数字,里面有 20 个人对该数字的录音音频,1_0.wav 就代表第 1 个人录数字 0 的音频文件。 为了比较不同方法的性能,我们将数据集按照 4:1 的比例分为训练集和测试集。我们更想了解的是模型在未知数据上的表现,于是前16个人的数据都被划分为了训练集,共 160 条音频文件;17-20 这四个人的音频为测试集,共 40 条 .wav 格式的录音。 2. 音频预处理 preprocess.py referencing https://github.com/rocketeerli/Computer-VisionandAudio-Lab/tree/master/lab1 当录数字的音频时,录音前后会有微小时间的空隙,并且每段音频的空白片段长度不一。如果忽略这个差异,直接对原音频进行建模运算,结果难免会受到影响,因此本文选择基于双阈值的语音端点检测对音频进行预处理。 对于每段 .wav 音频,首先用 wave 读取,并通过 np.frombuffer 和 readframes 得到二进制数组。然后,编写计算能量 (energy) 的函数 calEnergy 和计算过零率 (zero-crossing rate,ZCR) 的函数 calZeroCrossingRate,帧长 framesize 设置为常用的 256: def calEnergy(wave_data): """ :param wave_data: binary data of audio file :return: energy """ energy = [] sum = 0 for i in range(len(wave_data)): sum = sum + (int(wave_data[i]) * int(wave_data[i])) if (i + 1) % 256 == 0: energy.append(sum) sum = 0 elif i == len(wave_data) - 1: energy.append(sum) return energy def calZeroCrossingRate(wave_data): """ :param wave_data: binary data of audio file :return: ZeroCrossingRate """ zeroCrossingRate = [] sum = 0 for i in range(len(wave_data)): sum = sum + np.abs(int(wave_data[i] >= 0) - int(wave_data[i - 1] >= 0)) if (i + 1) % 256 == 0: zeroCrossingRate.append(float(sum) / 255) sum = 0 elif i == len(wave_data) - 1: zeroCrossingRate.append(float(sum) / 255) return zeroCrossingRate 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 代入音频的二进制数据得到能量和过零率数组之后,通过语音端点检测得到处理之后的音频数据,端点检测函数 endPointDetect 主要为三个步骤: 先根据 energy 和 zeroCrossingRate 计算高能阈值 (high energy threshold, MH) 第一步是计算较高的短时能量作为高能阈值 (high energy threshold, MH),可用于识别语音的浊音部分,得到初步检测数据 A。 第二步是计算较低的能量阈值 (low energy threshold, ML),使用该阈值搜索两端,第二次检测从 A 得到 B。 第三步是计算过零率阈值 (zero crossing rate threshold, Zs),在考虑过零率的基础上进一步处理第二步的结果 B,返回处理好的数据 C。 最后,将编号 1-16 被试者的音频文件存入 processed_train_records ,编号 17-20 存入 processed_test_records。 3. MFCC 特征提取 音频文件无法直接进行语音识别,需要对其进行特征提取。在语音识别(Speech Recognition)领域,最常用到的语音特征就是梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,MFCC)。 在 Python 中,可以用封装好的工具包 librosa 或 python_speech_features 实现对 MFCC 特征的提取。本文使用 librosa 提取给定音频的 MFCC 特征,每帧提取 39 维 MFCC 特征: import librosa def mfcc(wav_path, delta = 2): """ Read .wav files and calculate MFCC :param wav_path: path of audio file :param delta: derivative order, default order is 2 :return: mfcc """ y, sr = librosa.load(wav_path) # MEL frequency cepstrum coefficient mfcc_feat = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13) ans = [mfcc_feat] # Calculate the 1st derivative if delta >= 1: mfcc_delta1 = librosa.feature.delta(mfcc_feat, order=1, mode ='nearest') ans.append(mfcc_delta1) # Calculate the 2nd derivative if delta >= 2: mfcc_delta2 = librosa.feature.delta(mfcc_feat, order=2, mode ='nearest') ans.append(mfcc_delta2) return np.transpose(np.concatenate(ans, axis=0), [1,0]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 可以参考官网 librosa 和 python-speech-features 了解更多音频特征提取工具。 二、Isolated Speech Recognition - DTW 1. DTW 算法步骤 动态时间规整(DTW, Dyanmic Time Warping) 本质上是一种简单的动态规划算法,它可以分为两个步骤,一个是计算两种模式的帧之间的距离,即得到帧匹配距离矩阵;另一个是在帧匹配距离矩阵中找到最优路径。算法步骤如下: 首先,使用欧几里德距离 (Euclidean distance) 初始化成本矩阵 (cost matrix)。 其次,使用动态规划计算成本矩阵,将每个数据的向量模板向量进行比较,动态转移方程为 c o s t [ i , j ] + = m i n ( [ c o s t [ i , j ] , c o s t [ i + 1 , j ] , c o s t [ i , j + 1 ] ] ) . cost[i, j] += min([cost[i, j], cost[i+1, j], cost[i, j+1]]). cost[i,j]+=min([cost[i,j],cost[i+1,j],cost[i,j+1]]). 第三,计算规整路径 (warp path),最小距离为标准化距离。将第一个训练序列设置为标准 (template),合并所有训练序列并求其平均值,返回 0-9 位数字的模型。 测试过程中,使用 DTW 衡量每条音频到 0-9 这十个模板的”距离“,并选择模板距离最小的数字作为音频的预测值。 2. DTW 函数编写 编写 dtw 函数的关键在于两段 MFCC 序列的动态规划矩阵和规整路径的计算。令输入的两段 MFCC 序列分别为 M1 和 M2,M1_len 和 M2_len 表示各自的长度,则 cost matrix 的大小为 M1_len * M2_len,先用欧式距离对其进行初始化,再根据转移式计算 cost matrix: def dtw(M1, M2): """ Compute Dynamic Time Warping(DTW) of two mfcc sequences. :param M1, M2: two mfcc sequences :return: the minimum distance and the wrap path """ # length of two sequences M1_len = len(M1) M2_len = len(M2) cost_0 = np.zeros((M1_len + 1, M2_len + 1)) cost_0[0, 1:] = np.inf cost_0[1:, 0] = np.inf # Initialize the array size to M1_len * M2_len cost = cost_0[1:, 1:] for i in range(M1_len): for j in range(M2_len): cost[i, j] = calEuclidDist(M1[i], M2[j]) # dynamic programming to calculate cost matrix for i in range(M1_len): for j in range(M2_len): cost[i, j] += min([cost_0[i, j], \ cost_0[min(i + 1, M1_len - 1), j], \ cost_0[i, min(j + 1, M2_len - 1)]]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 欧式距离的计算函数 calEuclidDist 可以用一行代码完成: def calEuclidDist(A, B): """ :param A, B: two vectors :return: the Euclidean distance of A and B """ return sqrt(sum([(a - b) ** 2 for (a, b) in zip(A, B)])) 1 2 3 4 5 6 cost matrix 计算完成后还需计算 warp path,MFCC 序列长度为 1 时的路径可以单独分情况讨论,最小代价的 path_1 和 path_2 即为所求,最后返回数组 path: # calculate the warp path if len(M1) == 1: path = np.zeros(len(M2)), range(len(M2)) elif len(M2) == 1: path = range(len(M1)), np.zeros(len(M1)) else: i, j = np.array(cost_0.shape) - 2 path_1, path_2 = [i], [j] # path_1, path_2 with the minimum cost is what we want while (i > 0) or (j > 0): arg_min = np.argmin((cost_0[i, j], cost_0[i, j + 1], cost_0[i + 1, j])) if arg_min == 0: i -= 1 j -= 1 elif arg_min == 1: i -= 1 else: j -= 1 path_1.insert(0, i) path_2.insert(0, j) # convert to array path = np.array(path_1), np.array(path_2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3. 模型训练预测 在模型训练阶段,对于每个数字的 16 条音频文件,现计算 MFCC 序列到 mfcc_list 列表中,将第一个音频文件的 MFCC 序列设置为标准,计算标准模板和每条模板之间的动态时间规整路径,再对其求平均进行归一化,将结果 append 到 model 列表中。返回的 model 包含 0-9 这 10 个数字的 DTW 模型: # set the first sequence as standard, merge all training sequences mfcc_count = np.zeros(len(mfcc_list[0])) mfcc_all = np.zeros(mfcc_list[0].shape) for i in range(len(mfcc_list)): # calculate the wrap path between standard and each template _, path = dtw(mfcc_list[0], mfcc_list[i]) for j in range(len(path[0])): mfcc_count[int(path[0][j])] += 1 mfcc_all[int(path[0][j])] += mfcc_list[i][path[1][j]] # Generalization by averaging the templates model_digit = np.zeros(mfcc_all.shape) for i in range(len(mfcc_count)): for j in range(len(mfcc_all[i])): model_digit[i][j] = mfcc_all[i][j] / mfcc_count[i] # return model with models of 0-9 digits model.append(model_digit) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 测试阶段比较简单,将训练得到的 model 和需要预测音频的路径输入到函数 predict_dtw 中,分别计算 model 中每位数字到音频 MFCC 序列的最短距离 min_dist,min_dist 所对应的数字即为该音频的预测标签。 def predict_dtw(model, file_path): """ :param model: trained model :param file_path: path of .wav file :return: digit """ # Iterate, find the digit with the minimum distance digit = 0 min_dist, _ = dtw(model[0], mfcc_feat) for i in range(1, len(model)): dist, _ = dtw(model[i], mfcc_feat) if dist < min_dist: digit = i min_dist = dist return str(digit) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 三、Isolated Speech Recognition - GMM 1. GMM 模型训练流程 参考 sklearn 高斯混合模型中文文档:https://www.scikitlearn.com.cn/0.21.3/20/#211 高斯混合模型(GMM, Gaussian Mixed Model) 是一个假设所有的数据点都是生成于有限个带有未知参数的高斯分布所混合的概率模型,包含了数据的协方差结构以及隐高斯模型的中心信息,同样可以用于解决本文的数字识别任务。 scikit-learn 是基于 Python 语言的机器学习工具,sklearn.mixture 是其中应用高斯混合模型进行非监督学习的包,GaussianMixture 对象实现了用来拟合高斯混合模型的期望最大化 (EM, Expectation-Maximum) 算法。 如果想要直接调用封装好的库,只需在代码中加入 from sklearn.mixture import GaussianMixture,GaussianMixture.fit 便可以从训练数据中拟合出一个高斯混合模型,并用 predict 进行预测,详情参见官方文档。 GMM 模型的训练和预测过程 DTW 类似,此处不再赘述,流程图如下所示: 2. EM 算法步骤 参考文档:https://blog.youkuaiyun.com/nsh119/article/details/79584629?spm=1001.2014.3001.5501 GMM 模型的核心在于期望最大化 (EM, Expectation-Maximum) 算法,模型参数的训练步骤主要为 Expectation 的 E 步和 Maximum 的 M 步: (1) 取参数的初始值开始迭代。 (2) E 步:依据当前模型参数,计算分模型 k kk 对观测数据 y j y_jy j ​ 的相应度 γ ^ j k = α j ϕ ( y j ∣ θ k ) Σ k = 1 K α k ϕ ( y j ∣ θ k ) , j = 1 , 2 , . . . , N ; k = 1 , 2 , . . . , K \hat{\gamma}_{jk}=\dfrac{\alpha_j\phi(y_j|\theta_k)}{\Sigma^K_{k=1}\alpha_k\phi(y_j|\theta_k)},j=1,2,...,N; k=1,2,...,K γ ^ ​ jk ​ = Σ k=1 K ​ α k ​ ϕ(y j ​ ∣θ k ​ ) α j ​ ϕ(y j ​ ∣θ k ​ ) ​ ,j=1,2,...,N;k=1,2,...,K (3) M 步:计算新一轮迭代的模型参数 μ ^ k = Σ j = 1 N γ ^ j k y j Σ j = 1 N γ ^ j k , k = 1 , 2 , . . . , K σ ^ k 2 = Σ j = 1 N γ ^ j k ( y j − μ k ) 2 Σ j = 1 N γ ^ j k , k = 1 , 2 , . . . , K α ^ k = Σ j = 1 N γ ^ j k N , k = 1 , 2 , . . . , K μ^kσ^2kα^k=ΣNj=1γ^jkyjΣNj=1γ^jk,k=1,2,...,K=ΣNj=1γ^jk(yj−μk)2ΣNj=1γ^jk,k=1,2,...,K=ΣNj=1γ^jkN,k=1,2,...,K 𝜇 ^ 𝑘 = Σ 𝑗 = 1 𝑁 𝛾 ^ 𝑗 𝑘 𝑦 𝑗 Σ 𝑗 = 1 𝑁 𝛾 ^ 𝑗 𝑘 , 𝑘 = 1 , 2 , . . . , 𝐾 𝜎 ^ 𝑘 2 = Σ 𝑗 = 1 𝑁 𝛾 ^ 𝑗 𝑘 ( 𝑦 𝑗 − 𝜇 𝑘 ) 2 Σ 𝑗 = 1 𝑁 𝛾 ^ 𝑗 𝑘 , 𝑘 = 1 , 2 , . . . , 𝐾 𝛼 ^ 𝑘 = Σ 𝑗 = 1 𝑁 𝛾 ^ 𝑗 𝑘 𝑁 , 𝑘 = 1 , 2 , . . . , 𝐾 μ ^ ​ k ​ σ ^ k 2 ​ α ^ k ​ ​ = Σ j=1 N ​ γ ^ ​ jk ​ Σ j=1 N ​ γ ^ ​ jk ​ y j ​ ​ ,k=1,2,...,K = Σ j=1 N ​ γ ^ ​ jk ​ Σ j=1 N ​ γ ^ ​ jk ​ (y j ​ −μ k ​ ) 2 ​ ,k=1,2,...,K = N Σ j=1 N ​ γ ^ ​ jk ​ ​ ,k=1,2,...,K ​ 3. GMM 实现代码 sklearn.mixture/_gaussian_mixture.py:https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py GMM 模型实现可以参考 GaussianMixture 的源码,本文对 sklearn 的实现方式进行了简化,关键在于 GMM 这个类的函数实现。 我们实现的 GMM 类包含三个参数 (parameter) 和三个属性 (attribute):n_components 表示模型中状态的数目;mfcc_data 为音频提取的 MFCC 数据;random_state 是控制随机数的参数;means 表示在每个状态中 mixture component 的平均值;var 为方差;weights 是每个 component 的参数矩阵。 class GMM: """Gaussian Mixture Model. Parameters ---------- n_components : int Number of states in the model. mfcc_data : array, shape (, 39) Mfcc data of training wavs. random_state: int RandomState instance or None, optional (default=0) Attributes ---------- means : array, shape (n_components, 1) Mean parameters for each mixture component in each state. var : array, shape (n_components, 1) Variance parameters for each mixture component in each state. weights : array, shape (1, n_components) Weights matrix of each component. """ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 根据均值和方差可以计算对数高斯概率: def log_gaussian_prob(x, means, var): """ Estimate the log Gaussian probability :param x: input array :param means: The mean of each mixture component. :param var: The variance of each mixture component. :return: the log Gaussian probability """ return (-0.5 * np.log(var) - np.divide(np.square(x - means), 2 * var) - 0.5 * np.log(2 * np.pi)).sum() 1 2 3 4 5 6 7 8 9 GMM 类中包含五个部分:__init__ 用于初始化;e_step 是 Expectation 步骤;m_step 是 Maximization 步骤;train 用于调用 E 步和 M 步进行参数训练;log_prob 计算每个高斯模型的对数高斯概率。 def __init__(self, mfcc_data, n_components, random_state=0): # Initialization self.mfcc_data = mfcc_data self.means = np.tile(np.mean(self.mfcc_data, axis=0), (n_components, 1)) # randomization np.random.seed(random_state) for k in range(n_components): randn_k = np.random.randn() self.means[k] += 0.01 * randn_k * np.sqrt(np.var(self.mfcc_data, axis=0)) self.var = np.tile(np.var(self.mfcc_data, axis=0), (n_components, 1)) self.weights = np.ones(n_components) / n_components self.n_components = n_components def e_step(self, x): """ Expectation-step. :param x: input array-like data :return: Logarithm of the posterior probabilities (or responsibilities) of the point of each sample in x. """ log_resp = np.zeros((x.shape[0], self.n_components)) for i in range(x.shape[0]): log_resp_i = np.log(self.weights) for j in range(self.n_components): log_resp_i[j] += log_gaussian_prob(x[i], self.means[j], self.var[j]) y = np.exp(log_resp_i - log_resp_i.max()) log_resp[i] = y / y.sum() return log_resp def m_step(self, x, log_resp): """ Maximization step. :param x: input array-like data :param log_resp: Logarithm of the posterior probabilities (or responsibilities) of the point of each sample in data """ self.weights = np.sum(log_resp, axis=0) / np.sum(log_resp) denominator = np.sum(log_resp, axis=0, keepdims=True).T means_num = np.zeros_like(self.means) for k in range(self.n_components): means_num[k] = np.sum(np.multiply(x, np.expand_dims(log_resp[:, k], axis=1)), axis=0) self.means = np.divide(means_num, denominator) var_num = np.zeros_like(self.var) for k in range(self.n_components): var_num[k] = np.sum(np.multiply(np.square(np.subtract(x, self.means[k])), np.expand_dims(log_resp[:, k], axis=1)), axis=0) self.var = np.divide(var_num, denominator) def train(self, x): """ :param x: input array-like data """ log_resp = self.e_step(x) self.m_step(x, log_resp) def log_prob(self, x): """ :param x: input array-like data :return: calculated log gaussian probability of single modal Gaussian """ sum_prob = 0 for i in range(x.shape[0]): prob_i = np.array([np.log(self.weights[j]) + log_gaussian_prob(x[i], self.means[j], self.var[j]) for j in range(self.n_components)]) sum_prob += np.max(prob_i) return sum_prob 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 四、Isolated Speech Recognition - HMM 1. 隐马尔可夫模型 参考 hmmlearn 文档:https://hmmlearn.readthedocs.io/en/stable 隐马尔可夫模型(HMM, Hidden Markov Model) 是一种生成概率模型,其中可观测的 X XX 变量序列由内部隐状态序列 Z ZZ 生成。隐状态 (hidden states) 之间的转换假定为(一阶)马尔可夫链 (Markov chain) 的形式,可以由起始概率向量 π ππ 和转移概率矩阵 A AA 指定。可观测变量的发射概率 (emission probability) 可以是以当前隐藏状态为条件的任何分布,参数为 θ \thetaθ 。HMM 完全由 π ππ、A AA 和 θ \thetaθ 决定。 HMM 有三个基本问题: 给定模型参数和观测数据,估计最优隐状态序列 (estimate the optimal sequence of hidden states) 给定模型参数和观测数据,计算模型似然 (calculate the model likelihood) 仅给出观测数据,估计模型参数 (estimate the model parameters) 第一个和第二个问题可以通过动态规划中的维特比算法 (Viterbi algorithm) 和前向-后向算法 (Forward-Backward algorithm) 来解决,最后一个问题可以通过迭代期望最大化 (EM, Expectation-Maximization) 算法求解,也称为 Baum-Welch 算法。 如果想要直接调用封装好的库,只需在代码中加入 from hmmlearn import hmm,详情参见官方文档 和 源码。 2. 代码实现 对于本文的数字语音识别任务,HMM 模型的训练预测过程 DTW 和 GMM 类似,不再重复说明。HMM 模型实现可以参考 hmmlearn 的源码,本文的实现关键在于 HMM 这个类中的函数。 GMM 的 EM 算法不同的是,在 HMM 的实现中我们使用的是维特比算法 (Viterbi algorithm) 和前向-后向算法 (Forward-Backward algorithm)。 HMM 类包含两个参数 (parameter) 和四个属性 (attribute):n_components 表示模型中状态的数目;mfcc_data 为音频提取的 MFCC 数据;means 表示在每个状态中 mixture component 的平均值;var 为方差;trans_mat 是状态之间的转移矩阵。 class HMM(): """Hidden Markov Model. Parameters ---------- n_components : int Number of states in the model. mfcc_data : array, shape (, 39) Mfcc data of training wavs. Attributes ---------- init_prob : array, shape (n_components, ) Initial probability over states. means : array, shape (n_components, 1) Mean parameters for each mixture component in each state. var : array, shape (n_components, 1) Variance parameters for each mixture component in each state. trans_mat : array, shape (n_components, n_components) Matrix of transition probabilities between states. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 HMM 类中包含五个部分: __init__ 对参数进行初始化 (Initialization); viterbi Viterbi 算法用于解决 HMM 的解码问题,即找到观测序列中最可能的标记序列,本质上是一种动态规划算法 (dynamic programming); forward 计算所有时间步中所有状态的对数概率 (log probabilities); forward_backward 是参数估计的前向-后向算法 (Forward-Backward algorithm); train 用于调用 forward_backward 和 viterbi 进行参数训练; log_prob 计算每个模型的对数概率 (log likelihood)。 def viterbi(self, data): # Viterbi algorithm is used to solve decoding problem of HMM # That is to find the most possible labeled sequence of the observation sequence for index, single_wav in enumerate(data): n = single_wav.shape[0] labeled_seq = np.zeros(n, dtype=int) log_delta = np.zeros((n, self.n_components)) psi = np.zeros((n, self.n_components)) log_delta[0] = np.log(self.init_prob) for i in range(self.n_components): log_delta[0, i] += log_gaussian_prob(single_wav[0], self.means[i], self.var[i]) log_trans_mat = np.log(self.trans_mat) for t in range(1, n): for j in range(self.n_components): temp = np.zeros(self.n_components) for i in range(self.n_components): temp[i] = log_delta[t - 1, i] + log_trans_mat[i, j] + log_gaussian_prob(single_wav[t], self.means[j], self.var[j]) log_delta[t, j] = np.max(temp) psi[t, j] = np.argmax(log_delta[t - 1] + log_trans_mat[:, j]) labeled_seq[n - 1] = np.argmax(log_delta[n - 1]) for i in reversed(range(n - 1)): labeled_seq[i] = psi[i + 1, labeled_seq[i + 1]] self.states[index] = labeled_seq def forward(self, data): # Computes forward log-probabilities of all states at all time steps. n = data.shape[0] m = self.means.shape[0] alpha = np.zeros((n, m)) alpha[0] = np.log(self.init_prob) + np.array([log_gaussian_prob(data[0], self.means[j], self.var[j]) for j in range(m)]) for i in range(1, n): for j in range(m): alpha[i, j] = log_gaussian_prob(data[i], self.means[j], self.var[j]) + np.max( np.log(self.trans_mat[:, j].T) + alpha[i - 1]) return alpha def forward_backward(self, data): # forward backward algorithm to estimate parameters gamma_0 = np.zeros(self.n_components) gamma_1 = np.zeros((self.n_components, data[0].shape[1])) gamma_2 = np.zeros((self.n_components, data[0].shape[1])) for index, single_wav in enumerate(data): n = single_wav.shape[0] labeled_seq = self.states[index] gamma = np.zeros((n, self.n_components)) for t, j in enumerate(labeled_seq[:-1]): self.trans_mat[j, labeled_seq[t + 1]] += 1 gamma[t, j] = 1 gamma[n - 1, self.n_components - 1] = 1 gamma_0 += np.sum(gamma, axis=0) for t in range(n): gamma_1[labeled_seq[t]] += single_wav[t] gamma_2[labeled_seq[t]] += np.square(single_wav[t]) gamma_0 = np.expand_dims(gamma_0, axis=1) self.means = gamma_1 / gamma_0 self.var = (gamma_2 - np.multiply(gamma_0, self.means ** 2)) / gamma_0 for j in range(self.n_components): self.trans_mat[j] /= np.sum(self.trans_mat[j]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 五、实验结果 Github 项目地址:https://github.com/Sherry-XLL/Digital-Recognition-DTW_HMM_GMM 1. 实验环境 macOS Catalina Version 10.15.6, Python 3.8, PyCharm 2020.2 (Professional Edition). 1 2. 文件简介 dtw.py: Implementation of Dynamic Time Warping (DTW) gmm.py: Implementation of Gaussian Mixture Model (GMM) hmm.py: Implementation of Hidden Markov Model (HMM) gmm_from_sklearn.py: Train gmm model with GaussianMixture from sklearn hmm_from_hmmlearn.py: Train hmm model with hmm from hmmlearn preprocess.py: preprocess audios and split data processed_test_records: records with test audios processed_train_records: records with train audios records: original audios utils.py: utils function 1 2 3 4 5 6 7 8 9 10 11 12 若想运行代码,只需在命令行输入如下指令,以 DTW 为例: eg: python preprocess.py (mkdir processed records) python dtw.py 1 2 3 3. 结果对比 各个方法的数据集和预处理部分完全相同,下面是运行不同文件的结果: python dtw.py ----------Dynamic Time Warping (DTW)---------- Train num: 160, Test num: 40, Predict true num: 31 Accuracy: 0.78 1 2 3 4 python gmm_from_sklearn.py: ---------- GMM (GaussianMixture) ---------- Train num: 160, Test num: 40, Predict true num: 34 Accuracy: 0.85 1 2 3 4 python gmm.py ---------- Gaussian Mixture Model (GMM) ---------- confusion_matrix: [[4 0 0 0 0 0 0 0 0 0] [0 4 0 0 0 0 0 0 0 0] [0 0 4 0 0 0 0 0 0 0] [0 0 0 4 0 0 0 0 0 0] [0 0 0 0 4 0 0 0 0 0] [0 0 0 0 0 4 0 0 0 0] [0 0 0 0 0 0 4 0 0 0] [0 0 0 0 0 0 0 4 0 0] [0 0 0 0 0 0 0 0 4 0] [0 0 0 0 0 0 0 0 0 4]] Train num: 160, Test num: 40, Predict true num: 40 Accuracy: 1.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 python hmm_from_hmmlearn.py ---------- HMM (GaussianHMM) ---------- Train num: 160, Test num: 40, Predict true num: 36 Accuracy: 0.90 1 2 3 4 python hmm.py ---------- HMM (Hidden Markov Model) ---------- confusion_matrix: [[4 0 0 0 0 0 0 0 0 0] [0 4 0 0 0 0 0 0 0 0] [0 0 3 0 1 0 0 0 0 0] [0 0 0 4 0 0 0 0 0 0] [0 0 0 0 4 0 0 0 0 0] [0 0 0 0 1 3 0 0 0 0] [0 0 0 0 0 0 3 0 1 0] [0 0 0 0 0 0 0 4 0 0] [0 0 0 1 0 0 1 0 2 0] [0 0 0 0 0 0 0 0 0 4]] Train num: 160, Test num: 40, Predict true num: 35 Accuracy: 0.875 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Method DTW GMM from sklearn Our GMM HMM from hmmlearn Our HMM Accuracy 0.78 0.85 1.00 0.90 0.875 值得注意的是,上面所得正确率仅供参考。我们阅读源码就会发现,不同文件中 n_components 的数目并不相同,最大迭代次数 max_iter 也会影响结果。设置不同的超参数 (hyper parameter) 可以得到不同的正确率,上表并不是各种方法的客观对比。事实上 scikit-learn 的实现更完整详细,效果也更好,文中三种方法的实现仅为基础版本。 完整代码详见笔者的 Github 项目,如有问题欢迎在评论区留言指出。 ———————————————— 版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。 原文链接:https://blog.youkuaiyun.com/Sherry_ling/article/details/118713802
06-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值