General Framework for Acoustic Modeling
Building ASR system incrementally:
Context-independent ➔ Context-dependent modeling
Mono-phone ➔ Tri-phone HMM
Single Gaussian mixture per state ➔ Multiple Gaussian mixtures per state
Context-independent Modeling 上下文无关建模
Flowchart for Crossword Modeling:
Forced Alignment:
Input:
Word level transcription 词汇转录
Lexicon/Dictionary 词汇、字典
Multiple pronunciations 多重发音
Z. (z eh d vs. z iy)
HMMs
Output:
Phoneme level transcription of actual pronunciation with time boundary 具有时间边界的实际发音转换
To deal with the issue of imprecise transcription 处理不精确转录的问题最初,
Initially HMMs are trained on the basis of one fixed pronunciation per word HMM是根据每个单词一个固定的发音进行训练的
To determine the actual pronunciations in the utterances used to train the HMM system 确定用于训练HMM系统的话语中的实际发音
HVite is used in forced alignment mode to select the best matching pronunciations. HVite用于强制对齐模式,以选择最佳匹配发音。
The new phone level transcriptions can then be used to retrain the HMMs 然后可以使用新的phone级转录来重新训练HMM
Transcription snippets:转录片段
Input of Crossword Training:
Stage 1-Generate phone-based trans:
Stage 2 Generate monophone HMMs:
Stage 3-Generate triphone HMMs and trans
Stage 4-Bulid fully-trained triphone HMMs
Stage 5- TrainingPriors
Stage 6- Gender-specific HMMs
Output of Crossword Training