CMU 11-785 L16 Connectionist Temporal Classification

本文详细介绍了CTC(Connectionist Temporal Classification)的概念,它是一种用于序列到序列学习的无序时间同步方法。CTC允许输入和输出序列在时间上不同步,并探讨了解码过程中的选项,如选择最可能的符号和外部约束。文章还讨论了在没有时间信息的情况下如何找到最佳对齐方式,以及如何通过前向和后向算法来计算概率。CTC训练过程中可能出现的问题,如重复解码问题,也得到了解决。

Sequence to sequence

  • Sequence goes in, sequence comes out
  • No notion of “time synchrony” between input and output
    • May even nots maintain order of symbols (from one language to another)

With order synchrony

在这里插入图片描述

  • The input and output sequences happen in the same order
    • Although they may be time asynchronous
    • E.g. Speech recognition
      • The input speech corresponds to the phoneme sequence output
  • Question
    • How do we know when to output symbols
    • In fact, the network produces outputs at every time
    • Which of these are the real outputs?

Option 1

在这里插入图片描述

  • Simply select the most probable symbol at each time
  • Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
  • Problem
    • Cannot distinguish between an extended symbol and repetitions of the symbol
    • Resulting sequence may be meaningless

Option 2

  • Impose external constraints on what sequences are allowed
    • E.g. only allow sequences corresponding to dictionary words

Decoding

  • The process of obtaining an output from the network

  • Time-synchronous & order-synchronous sequence

    • aaabbbbbbccc => abc (probility 0.5)
      aabbbbbbbccc => abc (probility 0.001)
      cccddddddeee => cde (probility 0.4)
      cccddddeeeee => cde (probility 0.4)
      
    • So abc is the most likely time-synchronous output sequence
    • But cde is the the most likely order-synchronous sequence
  • Option 2 is in fact a suboptimal decode that actually finds the most likely time-synchronous output sequence

  • The “merging” heuristics do not guarantee optimal order-synchronous sequences

No timing information provided

在这里插入图片描述

  • Only the sequence of output symbols is provided for the training data
    • But no indication of which one occurs where

Guess the alignment

  • Initialize
    • Assign an initial alignment
    • Either randomly, based on some heuristic, or any other rationale
  • Iterate
    • Train the network using the current alignment
    • Reestimate the alignment for each training instance

Constraining the alignment

在这里插入图片描述

  • Try 1
    • Block out all rows that do not include symbols from the target sequence
    • E.g. Block out rows that are not /B/ /IY/ or /F/
  • Only decode on reduced grid
    • We are now assured that only the appropriate symbols will be hypothesized
  • But this still doesn’t assure that the decode sequence correctly expands the target symbol sequence
    • Order variance
  • Try 2
    • Explicitly arrange the constructed table
    • Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required
      • If a symbol occurs multiple times, we repeat the row in the appropriate location

在这里插入图片描述

  • Constrain that the first symbol in the decode must be the top left block

    • The last symbol must be the bottom right
    • The rest of the symbols must follow a sequence that monotonically travels down from top left to bottom right
    • This guarantees that the sequence is an expansion of the target sequence
  • Compose a graph such that every path in the graph from source to sink represents a valid alignment

    • The “score” of a path is the product of the probabilities of all nodes along the path
    • Find the most probable path from source to sink using any dynamic programming algorithm (viterbi algorithm)

Viterbi algorithm

  • Main idea

    • The best path to any node must be an extension of the best path to one of its parent nodes
  • Dynamically track the best path (and the score of the best path) from the source node to every node in the graph

    • At each node, keep track of
      • The best incoming parent edge (BP)
      • The score of the best path from the source to the node through this best parent edge (Bscr)
  • Process
    在这里插入图片描述

  • Algorithm
    在这里插入图片描述

Gradients

D I V = ∑ t X e n t ( Y t , s y m b o l t b e s t p a t h ) = − ∑ t log ⁡ Y ( t , s y m b o l t b e s t p a t h ) D I V=\sum_{t} X e n t\left(Y_{t}, s y m b o l_{t}^{b e s t p a t h}\right)=-\sum_{t} \log Y\left(t, s y m b o l_{t}^{b e s t p a t h}\right) DIV=tXent(Yt,symboltbestpath)=tlogY(t,symboltbestpath)

  • The gradient w.r.t the -th output vector Y t Y_t Yt
    ∇ Y t D I V = [ 0 0 ⋅ ⋅ ⋅ − 1 Y ( t , s y m b o l t b e s t p a t h ) 0 ⋅ ⋅ ⋅ 0 ] \nabla_{Y_{t}} D I V=[0 \quad 0 \cdot \cdot \cdot \frac{-1}{Y(t, s y m b o l_{t}^{b e s t p a t h})} \quad 0 \cdot \cdot \cdot 0] YtDIV=[00Y(t,symboltbestpath)100]

在这里插入图片描述

  • Problem

    • Approach heavily dependent on initial alignment
    • Prone to poor local optima
      • Because we commit to the single “best” estimated alignment
      • This can be way off, particularly in early iterations, or if the model is poorly initialized
  • Alternate view

    • There is a probability distribution over alignments of the target Symbol sequence (to the input)
    • Selecting a single alignment is the same as drawing a single sample from it
  • Instead of only selecting the most likely alignment, use the statistical expectation over all possible alignments

    • D I V = E [ − ∑ t log ⁡ Y ( t , s t ) ] D I V=E\left[-\sum_{t} \log Y\left(t, s_{t}\right)\right] DIV=E[tlogY(t,st)]

    • Use the entire distribution of alignments

    • This will mitigate the issue of suboptimal selection of alignment

  • Using the linearity of expectation

    • D I V = − ∑ t E [ log ⁡ Y ( t , s t ) ] D I V=-\sum_{t} E\left[\log Y\left(t, s_{t}\right)\right] DIV=tE[logY(t,st)]

    • D I V = − ∑ t ∑ S ∈ S 1 … S K P ( s t = S ∣ S , X ) log ⁡ Y ( t , s t = S ) D I V=-\sum_{t} \sum_{S \in S_{1} \ldots S_{K}} P\left(s_{t}=S | \mathbf{S}, \mathbf{X}\right) \log Y\left(t, s_{t}=S\right) DIV=tSS1SKP(st=SS,X)logY(t,st=S)

A posteriori probabilities of symbols

  • P ( s t = S ∣ S , X ) P(s_{t}=S | \mathbf{S}, \mathbf{X}) P(st=SS,X) is the probability of seeing the specific symbol s s

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值