What matters when building vision-language models?

本文深入探讨了构建视觉语言模型的重要因素,通过广泛的实验研究预训练模型、架构、数据和训练方法。研究结果促成了Idefics2的开发,这是一个80亿参数的高效基础VLM,表现出色并具有高推理效率。

本文是LLM系列文章,针对《What matters when building vision-language models?》的翻译。

摘要

大型语言模型和视觉transformer的改进推动了人们对视觉语言模型(VLM)日益增长的兴趣。尽管有大量关于这一主题的文献,但我们观察到,关于VLM设计的关键决策往往是不合理的。我们认为,这些未经支持的决策阻碍了该领域的进展,因为它们很难确定哪些选择可以提高模型性能。为了解决这个问题,我们围绕预训练的模型、架构选择、数据和训练方法进行了广泛的实验。我们对研究结果的整合包括Idefics2的开发,这是一种具有80亿个参数的高效基础VLM。Idefics2在各种多模态基准测试中实现了其大小类别中最先进的性能,并且通常与四倍于其大小的模型不相上下。我们发布了模型(基础、指导和聊天)以及为其训练创建的数据集。

1 引言

2 术语

3 探索视觉语言模型的设计空间

4 Idefics2-一个开放的最先进的视觉语言基础模型

5 结论

在这项工作中,我们重新审视了VLM文献中的常见选择,并在对照实验中严格比较了这些选择。我们的研究结果涉及不同架构的有效性、它们的性能/推理成本权衡以及训练稳定性。有了这些知识,我们就可以训练Idefics2,一个开放的8B参数视觉语言模型。Idefics2在其类别大小的各种基准测试中都是最先进的,并且在推理方面

### Key Components and Concepts in Transformers Architecture Transformers rely heavily on self-attention mechanisms, which constitute a critical part of their design[^1]. Self-attention allows models to weigh the significance of different words within a sentence relative to every other word. This mechanism enables more effective processing of sequential data without being constrained by fixed-length context windows. The architecture also incorporates multi-head attention layers that permit the model to jointly attend to information from different representation subspaces at various positions. Each head learns distinct patterns leading to richer representations overall. Positional encodings are added to the input embeddings since the self-attention layer does not inherently capture positional relationships between tokens. These encodings provide necessary ordering information about sequences so that the network can understand where each element appears in relation to others. Normalization techniques such as Layer Normalization play an essential role too; they stabilize training dynamics across multiple stacked transformer blocks ensuring consistent performance throughout deep networks. Feed-forward neural networks follow after normalization steps providing non-linearity required for learning complex mappings between inputs and outputs. ```python import torch.nn as nn class TransformerBlock(nn.Module): def __init__(self, embed_size, heads, dropout, forward_expansion): super(TransformerBlock, self).__init__() self.attention = MultiHeadAttention(embed_size, heads) self.norm1 = nn.LayerNorm(embed_size) self.norm2 = nn.LayerNorm(embed_size) self.feed_forward = nn.Sequential( nn.Linear(embed_size, forward_expansion * embed_size), nn.ReLU(), nn.Linear(forward_expansion * embed_size, embed_size), ) self.dropout = nn.Dropout(dropout) def forward(self, value, key, query, mask): attention = self.attention(value, key, query, mask) # Add & Norm x = self.dropout(self.norm1(attention + query)) forward = self.feed_forward(x) out = self.dropout(self.norm2(forward + x)) return out ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值