CMU 11-785 L15 Divergence of RNN

本文介绍了RNN的不同架构,包括一对一、多对多和序列到序列模型。重点讨论了语言建模中单词的表示,以及如何在不同设置下训练RNN。在多对多模型中,计算的发散是网络输出序列与期望输出序列之间的差异。在序列到序列模型中,如何确定何时输出符号成为挑战,提出了两种可能的解决方案:选择最可能的符号或外部约束输出序列。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Variants on recurrent nets

  • Architectures
    • How to train recurrent networks of different architectures
  • Synchrony
    • The target output is time-synchronous with the input
    • The target output is order-synchronous, but not time synchronous

One to one

在这里插入图片描述

  • No recurrence in model

    • Exactly as many outputs as inputs
    • One to one correspondence between desired output and actual output
  • Common assumption
    ∇ Y ( t ) Div ⁡ ( Y target ( 1 … T ) , Y ( 1 … T ) ) = w t ∇ Y ( t ) Div ⁡ ( Y target ( t ) , Y ( t ) ) \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right) Y(t)Div(Ytarget(1T),Y(1T))=wtY(t)Div(Ytarget(t),Y(t))

    • w t w_t wt is typically set to 1.0

Many to many

在这里插入图片描述

  • The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
  • This is not just the sum of the divergences at individual times
Language modelling: Representing words
  • Represent words as one-hot vectors

    • Sparse problem
    • Makes no assumptions about the relative importance of words
  • The Projected word vectors

    • Replace every one-hot vector W i W_i Wi by P W i PW_i PWi
    • P P P is an M × N M\times N M×N matrix
  • How to learn projections

在这里插入图片描述

  • Soft bag of words
    • Predict word based on words in immediate context
    • Without considering specific position
  • Skip-grams
    • Predict adjacent words based on current word

在这里插入图片描述

Many to one

  • Example
    • Question answering
      • Input : Sequence of words
      • Output: Answer at the end of the question
    • Speech recognition
      • Input : Sequence of feature vectors (e.g. Mel spectra)
      • Output: Phoneme ID at the end of the sequence

在这里插入图片描述

  • Outputs are actually produced for every input

    • We only read it at the end of the sequence
  • How to train

    • Define the divergence everywhere
      • D I V ( Y target , Y ) = ∑ t w t Xent ⁡ ( Y ( t ) ,  Phoneme ) D I V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme}) DIV(Ytarget,Y)=twtXent(Y(t), Phoneme)
    • Typical weighting scheme for speech
      • All are equally important
    • Problem like question answering
      • Answer only expected after the question ends

Sequence-to-sequence

在这里插入图片描述

  • How do we know when to output symbols
    • In fact, the network produces outputs at every time
    • Which of these are the real outputs
      • Outputs that represent the definitive occurrence of a symbol

在这里插入图片描述

  • Option 1: Simply select the most probable symbol at each time
    • Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
    • Cannot distinguish between an extended symbol and repetitions of the symbol
    • Resulting sequence may be meaningless
  • Option 2: Impose external constraints on what sequences are allowed
    • Only allow sequences corresponding to dictionary words
    • Sub-symbol units
  • How to train when no timing information provided

在这里插入图片描述

  • Only the sequence of output symbols is provided for the training data
    • But no indication of which one occurs where
  • How do we compute the divergence?
    • And how do we compute its gradient
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值