CMU 11-785 L15 Divergence of RNN

最新推荐文章于 2021-03-02 09:35:18 发布

zealscott

最新推荐文章于 2021-03-02 09:35:18 发布

阅读量439

点赞数

分类专栏： CMU 11-785 文章标签：深度学习神经网络数据挖掘机器学习

本文链接：https://blog.youkuaiyun.com/crazy_scott/article/details/106447486

版权

CMU 11-785 专栏收录该内容

22 篇文章

订阅专栏

本文介绍了RNN的不同架构，包括一对一、多对多和序列到序列模型。重点讨论了语言建模中单词的表示，以及如何在不同设置下训练RNN。在多对多模型中，计算的发散是网络输出序列与期望输出序列之间的差异。在序列到序列模型中，如何确定何时输出符号成为挑战，提出了两种可能的解决方案：选择最可能的符号或外部约束输出序列。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Variants on recurrent nets

Architectures
- How to train recurrent networks of different architectures
Synchrony
- The target output is time-synchronous with the input
- The target output is order-synchronous, but not time synchronous

One to one

在这里插入图片描述

No recurrence in model
- Exactly as many outputs as inputs
- One to one correspondence between desired output and actual output
Common assumption
$\nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right)$
- $w_t$ is typically set to 1.0

Many to many

在这里插入图片描述

The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
This is not just the sum of the divergences at individual times

Language modelling: Representing words

Represent words as one-hot vectors
- Sparse problem
- Makes no assumptions about the relative importance of words
The Projected word vectors
- Replace every one-hot vector $W_i$ by $PW_i$
- $P$ is an $M\times N$ matrix
How to learn projections

在这里插入图片描述

Soft bag of words
- Predict word based on words in immediate context
- Without considering specific position
Skip-grams
- Predict adjacent words based on current word

在这里插入图片描述

Many to one

Example
- Question answering
  - Input : Sequence of words
  - Output: Answer at the end of the question
- Speech recognition
  - Input : Sequence of feature vectors (e.g. Mel spectra)
  - Output: Phoneme ID at the end of the sequence

在这里插入图片描述

Outputs are actually produced for every input
- We only read it at the end of the sequence
How to train
- Define the divergence everywhere
  - $V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme})$
- Typical weighting scheme for speech
  - All are equally important
- Problem like question answering
  - Answer only expected after the question ends

Sequence-to-sequence

在这里插入图片描述

How do we know when to output symbols
- In fact, the network produces outputs at every time
- Which of these are the real outputs
  - Outputs that represent the definitive occurrence of a symbol

在这里插入图片描述

Option 1: Simply select the most probable symbol at each time
- Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
- Cannot distinguish between an extended symbol and repetitions of the symbol
- Resulting sequence may be meaningless
Option 2: Impose external constraints on what sequences are allowed
- Only allow sequences corresponding to dictionary words
- Sub-symbol units
How to train when no timing information provided

在这里插入图片描述

Only the sequence of output symbols is provided for the training data
- But no indication of which one occurs where
How do we compute the divergence?
- And how do we compute its gradient