Understanding LSTM Networks

本文深入浅出地介绍了长短期记忆网络(LSTM),一种特殊的循环神经网络(RNN),旨在解决长期依赖问题。通过详细的步骤说明了LSTM如何通过门控机制有效保留和更新信息。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

Recurrent Neural Networks have loops.

In the above diagram, a chunk of neural network,  A A, looks at some input  xt xt and outputs a value  ht ht. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

An unrolled recurrent neural network.
An unrolled recurrent neural network.

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

Neural networks struggle with long term dependencies.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!

LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

A LSTM neural network.
The repeating module in an LSTM contains four interacting layers.

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at  ht1 ht−1 and  xt xt, and outputs a number between  0 0 and  1 1 for each number in the cell state  Ct1 Ct−1. A  1 1represents “completely keep this” while a  0 0 represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values,  C~t C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state,  Ct1 Ct−1, into the new cell state  Ct Ct. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by  ft ft, forgetting the things we decided to forget earlier. Then we add  itC~t it∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through  tanh tanh (to push the values to be between  1 −1 and  1 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

A gated recurrent unit neural network.

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015)tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al.(2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs byKalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015)Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

Acknowledgments

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals,Greg CorradoJon ShlensLuke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.


  1. In addition to the original authors, a lot of people contributed to the modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Daan Wierstra, Julian Togelius, Faustino Gomez, Matteo Gagliolo, and Alex Graves.

<think>好的,我现在需要帮助用户解决如何将LSTM与自注意力机制结合使用的问题。用户可能正在寻找实现方法或者学习资料。首先,我得回顾一下LSTM和自注意力机制的基本概念,确保自己理解正确。 LSTM是一种循环神经网络,擅长处理序列数据,能够捕捉长期依赖关系。自注意力机制则可以让模型在处理序列时,动态地关注不同位置的信息,增强模型对重要部分的关注。结合两者可能是在LSTM的基础上加入注意力机制,或者用注意力来改进LSTM的某些部分。 接下来,我需要考虑用户可能需要的具体实现步骤。通常,结合的方式可能有几种:比如在LSTM的输出上应用自注意力,或者在LSTM的每个时间步引入注意力机制。例如,可以使用LSTM处理序列,得到所有时间步的隐藏状态,然后对这些隐藏状态应用自注意力层,以捕捉全局依赖关系。 然后,我应该考虑具体的代码示例。PyTorch是一个常用的框架,所以用PyTorch来演示比较合适。可能需要定义一个LSTM层,然后接一个自注意力层。自注意力层的实现需要计算查询、键和值,然后进行缩放点积注意力计算。需要确保代码正确,比如维度是否正确,注意力权重的计算是否合适。 另外,用户可能需要学习资料,比如教程或论文。我应该推荐一些相关的资源,比如《Attention Is All You Need》虽然主要讲Transformer,但其中的自注意力机制是核心。还有像PyTorch的官方教程,或者Medium上的相关文章,比如用户提到的“A Comprehensive Guide to Visual Question Answering”可能涉及多模态中的注意力应用,但也可以作为参考。 还需要注意用户提到的引用格式,确保在回答中引用相关资源时加上脚注,比如[^1]。同时,回答的结构要清晰,分步骤说明实现方法,并提供代码示例和相关问题。 最后,生成的相关问题要围绕LSTM和自注意力的结合,比如优缺点、具体应用场景、与Transformer的比较等,帮助用户进一步探索。 现在需要检查是否有遗漏的要点,比如确保所有数学公式用$...$格式,独立公式用$$...$$,并正确使用中文。同时,代码部分要用正确的语法,并单独成块。可能还需要验证自注意力实现的正确性,比如是否进行了softmax和正确的矩阵乘法。</think>### LSTM与自注意力机制结合的实现方法 #### 1. 基本思路 LSTM通过门控机制捕捉序列的长期依赖关系,而自注意力机制能动态分配不同位置的权重。结合两者的常见方式为: - **序列处理层**:使用LSTM处理输入序列,获取隐藏状态序列$h_1, h_2, \dots, h_t$ - **注意力计算层**:对LSTM输出的隐藏状态序列应用自注意力,生成加权后的上下文向量$c$ 数学表达式为: $$ c = \sum_{i=1}^t \alpha_i h_i \quad \text{其中} \quad \alpha_i = \text{softmax}(QK^T/\sqrt{d_k}) $$ 其中$Q$和$K$分别由隐藏状态转换得到[^1]。 #### 2. PyTorch实现示例 ```python import torch import torch.nn as nn class LSTMWithSelfAttention(nn.Module): def __init__(self, input_dim, hidden_dim, attention_dim): super().__init__() self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True) self.query = nn.Linear(hidden_dim, attention_dim) self.key = nn.Linear(hidden_dim, attention_dim) def forward(self, x): # LSTM处理序列 lstm_out, _ = self.lstm(x) # [batch, seq_len, hidden_dim] # 计算自注意力权重 Q = self.query(lstm_out) # [batch, seq_len, attn_dim] K = self.key(lstm_out) # [batch, seq_len, attn_dim] attn_weights = torch.bmm(Q, K.transpose(1,2)) / (K.size(-1)**0.5) attn_weights = torch.softmax(attn_weights, dim=-1) # 加权求和 context = torch.bmm(attn_weights, lstm_out) return context ``` #### 3. 关键改进方向 - **位置编码**:在注意力层前加入位置编码,弥补LSTM可能丢失的位置信息 - **多层结构**:堆叠多个LSTM+Attention层形成深度架构 - **并行计算**:使用packed sequence处理变长输入 #### 4. 学习资源推荐 1. 《Attention Is All You Need》原始论文(Transformer理论基础) 2. PyTorch官方序列建模教程:`https://pytorch.org/tutorials/beginner/seq2seq_translation_tutorial.html` 3. 博客《Understanding LSTM Networks with Attention Mechanisms》详细解析实现细节
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值