RNNS ARE NOT TRANSFORMERS (YET): The Key Bottleneck on In-context Retrieval 翻译

快速批量处理 PDF:Doc2X 为您服务
想要批量处理 PDF 转 Word、Latex 或 Markdown?Doc2X 提供高效的 公式解析、表格识别、代码解析,支持 深度翻译 和 大模型训练语料提取,为科研文档处理提速!
Fast Batch PDF Processing: Doc2X at Your Service
Need batch processing for PDF to Word, LaTeX, or Markdown? Doc2X delivers efficient formula parsing, table recognition, and code parsing, with support for advanced translation and large-model training corpus extraction, boosting research productivity!
👉 立即访问 Doc2X | Visit Doc2X Now

原文链接:https://arxiv.org/pdf/2402.18510

RNNS ARE NOT TRANSFORMERS (YET): The Key Bottleneck on In-context Retrieval

RNNS 还不是 TRANSFORMERS(目前):上下文检索的关键瓶颈

Kaiyue Wen 1 ∗ {}^{1 * } 1 Xingyu Dang 1 ∗ {}^{1 * } 1 Kaifeng Lyu 2 † {}^{2 \dagger } 2

温凯越 1 ∗ {}^{1 * } 1 党星宇 1 ∗ {}^{1 * } 1 吕凯丰 2 † {}^{2 \dagger } 2

1 {}^{1} 1 Institute for Interdisciplinary Information Sciences,Tsinghua University

1 {}^{1} 1 清华大学交叉信息研究院

2 {}^{2} 2 Department of Computer Science &Princeton Language and Intelligence,Princeton University

2 {}^{2} 2 普林斯顿大学计算机科学与普林斯顿语言与智能系

{wenky20,dangxy20}@mails.tsinghua.edu.cn

{wenky20,dangxy20}@mails.tsinghua.edu.cn

klyu@cs.princeton.edu

ABSTRACT

摘要

This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, ca

### Focal Loss Implementation and Optimization in SlowFast Model The integration of Focal Loss into the SlowFast model for video recognition involves several key considerations to ensure effective performance. The primary objective is addressing class imbalance issues that often arise during action classification tasks. #### Understanding Class Imbalance Challenges Class imbalance can lead to biased models where frequent classes dominate predictions at the expense of less common ones. Traditional cross-entropy loss may not adequately address this issue because it assigns equal weight to all examples regardless of their frequency or difficulty level[^1]. To mitigate these challenges, Focal Loss introduces a modulating factor \((1-p_t)^γ\) which reduces the contribution from easy-to-classify samples while increasing focus on hard negatives. This allows the network to pay more attention to rare but important events within videos processed by the SlowFast architecture. #### Implementing Focal Loss in PyTorch Below demonstrates how one might implement Focal Loss specifically tailored for use with the SlowFast framework: ```python import torch import torch.nn.functional as F class FocalLoss(torch.nn.Module): def __init__(self, alpha=1, gamma=2, reduction='mean'): super(FocalLoss, self).__init__() self.alpha = alpha self.gamma = gamma self.reduction = reduction def forward(self, inputs, targets): BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduce=False) pt = torch.exp(-BCE_loss) focal_loss = self.alpha * (1-pt)**self.gamma * BCE_loss if self.reduction == 'mean': return torch.mean(focal_loss) elif self.reduction == 'sum': return torch.sum(focal_loss) else: return focal_loss ``` This custom `FocalLoss` module adjusts standard binary cross entropy through an additional weighting scheme based upon prediction confidence levels \(p_t\). By incorporating such adjustments directly into training objectives, networks like SlowFast become better equipped to handle skewed distributions found across various datasets. #### Optimizing Training Process When applying Focal Loss alongside architectures similar to SlowFast, consider experimenting with different values for hyperparameters α (alpha) and γ (gamma): - **Alpha**: Controls relative importance between positive vs negative instances. - **Gamma**: Determines degree of focusing power applied towards difficult cases. Additionally, fine-tuning learning rates along with employing techniques like batch normalization helps stabilize convergence when introducing new losses into existing pipelines[^3]. --related questions-- 1. How does Focal Loss compare against other methods designed to tackle imbalanced data problems? 2. What specific advantages do multi-stream approaches offer over single-path alternatives in temporal modeling scenarios? 3. Can you provide insights regarding optimal strategies for selecting appropriate parameter settings (\(α,\)\(\gamma\)) given varying dataset characteristics? 4. Are there any notable differences observed when implementing Focal Loss across diverse neural network structures beyond just SlowFast?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值