Kaldi lattices format

本文介绍Kaldi中的WFST解码方法及两种格子(lattice)表示形式:Lattice与Compact Lattice,并解释了如何从格子中获取准确的时间信息。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Kaldi is based on WFST for decoding so is the lattices. Firstly, a WFST has a set of states with one distinguished start state; each state has a final cost; and there is a set of arcs between the states, where each arc has an input label, an output label, and a weight, which is also referred to as a cost and is usually  the negated log-probability

To decode an utterance with T frames, the WFST interpretation is as follows. We construct an accepter, or WFSA (WFST with the same input and output symbol for all the arcs). It has T+1 states, with an arc for each cobmination of (time, context-dependent HMM state). The costs on there arcs correspond to  negated and scaled acoustic log-likelihood . Call this acceptor U, then the complete search space is:
S= U * HCLG, where * represents the composition operation and HCLG is the decoding graph generated for Kaldi. HCLG integrates the HMM transition, context expansion, lexicon and most importantly the language model. In composition, the weights are added together. As they are negated log-likelihoods, it is effectively multiplying the original probabilities.

There are two representations of lattices in Kaldi:

1) the Lattice type

Each lattice is stored as an FST, with the following format:

[start state id] [end state id] [input symbol] [output symbol] [weight]

Usually the input symbols are the transition-ids and output symbols are words. The weight is usually:

[the graph cost],[the acoustic cost]

[the graph cost]  : a sum of the LM cost, the (weighted) transition probabilities, and any pronunciation cost. Since they are mixed together, in order to do things like LM rescoring Kaldi typically first subtract the "old" LM cost and then add in the "new" LM cost to avoid touching the other parts of the cost.
[the acoustic cost]  : the acoustic cost is the  unscaled raw scores . The scaling is applied before any algorithm and unscale them again before writing them out.

Example:
11031_A_20120617_182613_000704
0       616     273     2402    10.9106,2539.22
0       667     345     3       4.09183,365.022
0       672     273     1268    13.7098,2659.86
0       726     273     758     12.6504,2686.83
0       780     273     3157    14.3474,2700.35
0       834     273     1992    16.2729,2677.13
1       888     281     3       10.8443,4811.55
... ...
 

2) the CompactLattice type

This is also in FSA, acceptor, format, with an arc format as:

[start state id] [end state id] [input & output symbol] [weight]

The input/output symbol are usually the word ids and the weight is:

[the graph cost],[the acoustic cost],[a string sequence of integers]

The integers represent the transition ids, i.e. the frame-alignment for this word symbol.

Example:
11031_A_20120617_182613_000704
0       20      2402    10.9106,2539.22,273_282_280_280_280_280_280_280_280_280_280_288_63704_63703_63712_63732_63731_63731_58806_58805_58805_58805_58805_58864_58863_58888_15286_15285_15285_15285_15285_15285_15285_15285_15285_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375
0       16      3       4.09183,365.022,345_354_352_352_352_352
0       9       1268    13.7098,2659.86,273_282_280_280_280_280_280_280_280_280_280_280_288_42724_42723_42736_42776_42775_3698_3697_3712_3711_3711_3711_3711_3711_3711_3762_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273
0       5       758     12.6504,2686.83,273_282_280_280_280_280_280_280_280_280_280_280_288_64680_64702_64708_64707_64707_3680_3679_3679_3724_3723_3723_3762_3761_3761_3761_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273
0       3       3157    14.3474,2700.35,273_282_280_280_280_280_280_280_280_280_280_280_288_3186_3185_3185_3185_3185_3185_3226_3225_3225_3225_3225_3246_3245_3245_3245_15234_15233_15233_15233_15233_15233_15233_15314_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15313_15376_15375_15375_15375_15375_15375_15375_273

... ...

The conversion can be done using the  lattice-copy  with the option  --write-compact . Besides, in Kaldi, these two types of lattices are not distinguished for I/O purpose, that's to say the tools dealing with lattices can take in and output any of these two formats. However, typically, the CompactLattice is used as the default output format.

Many algorithms on lattices (for instance, taking the best path, or pruning) are most efficient to do using the Lattice type rather than the CompactLattice type.

In general, the words in Kaldi lattices are not synchronized with to the transition-ids, meaning that the transition-ids on an arc won't necessarily all belong to the word whose label is on that arc. This means the times you get from the lattice will be inexact. It is also true of the weights. To obtain the exact times, you should run the program lattice-align-words. It only works if the system is based on word-position-dependent phones and it requires certain command line options to tell which phones are in which position in the word. An alternative program, lattice-align-words-lexicon can be used if the system does not have word-position-dependent phones.

An HTK lattice has the exact time information. An example HTK lattice is:

VERSION=1.0
UTTERANCE=testf1.mfc
lmname=wdnet
lmscale=20.00 wdpenalty=-30.00
vocab=dict
N=31 L=56
I=0 t=0.00
I=1 t=0.36
I=2 t=0.75
I=3 t=0.81
... etc
I=30 t=2.48
J=0 S=0 E=1 W=SILENCE v=0 a=-3239.01 l=0.00
J=1 S=1 E=2 W=FOUR v=0 a=-3820.77 l=0.00
... etc
J=55 S=29 E=30 W=SILENCE v=0 a=-246.99 l=-1.20


The first 5 lines comprise a header which records names of the files used to generate the lattice along with the settings of the language model scale and penalty factors. Each node in the lattice represents a point in time measured in seconds and each arc represents a word spanning the segment of the input starting at the time of its start node and ending at the time of its end node. For each such span, v gives the number of the pronunciation used, a gives the acoustic score and l gives the language model score.

The language model scores in output lattices do not include the scale factors and penalties. These are removed so that the lattice can be used as a constraint network for subsequent recognizer testing.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值