Training Very Deep Networks公式推导

本文探讨了训练深层神经网络时遇到的优化难题,并介绍了一种改进方法。通过引入T和C函数,实现输入输出间的有效映射,特别是当T取特定值时,能够简化网络结构,形成残差连接。

原论文:Training Very Deep Networks

作者:Rupesh Kumar Srivastava Klaus Greff Jurgen Schmidhuber

时间:22 Jul 2015

本文的大部分观点来自于这篇论文,并且加入了一些自己的理解。该博客纯属读书笔记。


假设一个简单的由L层隐层构成的网络,我们令它的参数去拟合函数H,为了推导方便,我们假设输入输出的维度相同。那么我们可以得到输出y:

1

通过实验证明求解器很难优化这个函数,所以我们将这个函数做适当的变形。我们加了两个新的函数T和C:

2

在这里我们可以简单的将函数T理解为实现了对原函数H,也就是输入输出的映射函数的一个放大或缩小的功能。函数C是对输入C的直接传送,使输入不经过函数H的映射直接加到输出y上。they express how much of the output is produced by transforming the input and carrying it 。

为了让式子更简洁,我们让C=1-T,得到:

3

我们发现:

4

当T=0,输入和输出变成了恒等映射关系,当T=1,输入和输出变成了原来的y=H(x)。

而当T=0.5时我们发现,y=0.5*H(x)+0.5x,再用F(x)代替H(x)*2,我们发现y=x+F(x),所以残差函数实际上是函数3的一个特例。

文章除了实验以外,并没有给出足够的理论依据。


### DeepSeek R1 Classification Task Implementation and Issues In addressing the implementation and potential issues related to the DeepSeek R1 classification task, it is important to consider several aspects including architecture design, training strategies, and optimization techniques. The choice of convolutional neural network (CNN) architecture plays a critical role in achieving high performance on image classification tasks. For instance, ZFNet introduced improvements over AlexNet by using smaller strides and filter sizes which resulted in more distinctive first-layer features and fewer inactive or "dead" features[^1]. This suggests that when implementing the DeepSeek R1 model, adopting similar architectural choices can lead to better feature extraction capabilities. Regarding training challenges, one common issue encountered during deep learning projects involves ensuring efficient utilization of computational resources while maintaining accuracy. Techniques such as coresets—a method for summarizing large datasets into much smaller subsets—can be beneficial for reducing computation costs associated with training very deep networks like those used in advanced classifiers[^2]. Additionally, fine-tuning hyperparameters remains crucial throughout development phases; this includes adjusting parameters specific to both hardware configurations and software environments where models will eventually run. Ensuring proper scaling methods are applied at each stage helps prevent numerical instability problems often seen within complex architectures. For optimizing inference speed post-training, applying neural pruning alongside other forms of network compression could significantly reduce latency without compromising too heavily upon predictive power. Such approaches aim to eliminate redundant connections inside trained weights matrices thereby streamlining overall structure efficiency further enhancing real-world applicability especially under constrained conditions found outside controlled lab settings. ```python import torch.nn as nn class DeepSeekR1(nn.Module): def __init__(self): super(DeepSeekR1, self).__init__() # Define layers inspired by improved CNN designs mentioned above. def forward(self, x): pass # Implement forward propagation logic here. ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值