声学特征 ivector


提取流程

1.UBM

universal background model[1]
使用GMM建模,UBM的训练通过EM算法完成,有两种方法:

  • 所有的数据训练出来一个UBM,需要保证训练数据的均衡
  • 训练多个UBM,然后合在一起,比如根据性别分成两个,这样的话可以更有效的利用非均衡数据以及控制最后的UBM。

2.supervector

这里写图片描述
使用MAP adaptation对UBM的高斯进行线性插值,获得speaker相关的GMM模型,该模型的均值作为supervector[2]。详细的训练过程参考[1].
假设UBM有C个分量,特征维度为F,那么最后得到的supervector的维度为C*F

3.ivector

identity vector
s=m+Tws=m+Tws=m+Tw
sss: supervector
mmm: ubm’s mean supervector
TTT: total-vavriability matrix
www: i-vector
sssmmm前两部分已经获得,为了获得最后的www,只剩下获得TTT
使用EM算法可以获得最后的TTT[3].

4.LDA PLDA

ivector同时包含speaker和channel的信息,使用LDA和WCCN来减弱channel影响。

kaldi实现

1.UBM

universal background model 使用gmm来刻画
UBM训练流程,最后得到final.dubm

steps/online/nnet2/train_diag_ubm.sh
#gmm-global-init-from-feats 根据所有特征训练gmm
#gmm-gselect gmm-global-acc-stats 获取gmm训练的统计量
#gmm-global-est 根据统计量重新训练gmm
#gmm-global-copy 转化final.dubm为文本形式

假设特征40维,高斯个数为512

2.extractor

ivector模型用来提取100维ivector特征,和mfcc特征合在一起当做dnn的输入,最后生成的模型是final.ie,训练流程如下

steps/online/nnet2/train_ivector_extractor.sh
#ivector-extractor-init 使用final.dubm初始化最开始的ivector
#gmm-global-get-post 根据final.dubm获取cmvn后的特征的后验概率
#ivector-extractor-sum-accs 获取统计量
#ivector-extractor-est 根据统计量获得最后ivector模型final.ie

ivector-extractor-init --binary=false --ivector-dim=100 --use-weights=false "gmm-global-to-fgmm final.dubm -|" txt #查看文本形式的ie

由于sss的维度是51240,mmm的维度也是51240,www的维度是100,所以最后得到的TTT的维度为51210040

3.提取ivector

ivector可以每一句一个,online的形式可以设成10帧一个,需要的文件包括:

--cmvn-config=run/run_chain_1000h_pitch/exp/ivectors/train_max2/conf/online_cmvn.conf
--ivector-period=10
--splice-config=run/run_chain_1000h_pitch/exp/ivectors/train_max2/conf/splice.conf
--lda-matrix=run/run_chain_1000h_pitch/exp/extractor/final.mat
--global-cmvn-stats=run/run_chain_1000h_pitch/exp/extractor/global_cmvn.stats
--diag-ubm=run/run_chain_1000h_pitch/exp/extractor/final.dubm
--ivector-extractor=run/run_chain_1000h_pitch/exp/extractor/final.ie
--num-gselect=5
--min-post=0.025
--posterior-scale=0.1
--max-remembered-frames=1000
--max-count=0

ivector提取流程如下:

steps/online/nnet2/extract_ivectors_online.sh
#1.特征处理:cmvn+splice+lda
#2.根据特征和m(final.dubm)获得每个speaker对应的s
#3.根据s、m(final.dubm)、T(final.ie)得到w

#查看ivector特征
copy-feats --binary=false --compress=false ark:ivector_online.1.ark ark,t:ivector_online.1.ark.txt

训练和解码的文件需要保持一致,不然结果会差距比较大。

参考文献

[1].Speaker Verification Using Adapted Gaussian Mixture Models
[2].Support Vector Machines using GMM Supervectors for Speaker Verification
[3].Implementation of the Standard I-vector System for the Kaldi Speech Recognition Toolkit

后面的技术分享转移到微信公众号上面更新了,【欢迎扫码关注交流】

在这里插入图片描述

MSR Identity Toolbox: A Matlab Toolbox for Speaker Recognition Research Version 1.0 Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck Microsoft Research, Conversational Systems Research Center (CSRC) s.omid.sadjadi@gmail.com, {mslaney,larry.heck}@microsoft.com This report serves as a user manual for the tools available in the Microsoft Research (MSR) Identity Toolbox. This toolbox contains a collection of Matlab tools and routines that can be used for research and development in speaker recognition. It provides researchers with a test bed for developing new front-end and back-end techniques, allowing replicable evaluation of new advancements. It will also help newcomers in the field by lowering the “barrier to entry”, enabling them to quickly build baseline systems for their experiments. Although the focus of this toolbox is on speaker recognition, it can also be used for other speech related applications such as language, dialect and accent identification. In recent years, the design of robust and effective speaker recognition algorithms has attracted significant research effort from academic and commercial institutions. Speaker recognition has evolved substantially over the past 40 years; from discrete vector quantization (VQ) based systems to adapted Gaussian mixture model (GMM) solutions, and more recently to factor analysis based Eigenvoice (i-vector) frameworks. The Identity Toolbox provides tools that implement both the conventional GMM-UBM and state-of-the-art i-vector based speaker recognition strategies. A speaker recognition system includes two primary components: a front-end and a back-end. The front-end transforms acoustic waveforms into more compact and less redundant representations called acoustic features. Cepstral features are most often used for speaker recognition. It is practical to only retain the high signal-to-noise ratio (SNR) regions of the waveform, therefore there is also a need for a speech activity detector (SAD) in the fr
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值