NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

本文介绍了NetVLAD——一种用于大规模视觉位置识别的CNN架构。NetVLAD通过弱监督学习,实现对查询照片位置的快速准确识别。文章详细阐述了NetVLAD的设计原理、训练流程及在实际应用中的优异表现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition
NetVLAD:用于弱监督位置识别的CNN架构

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5297-5307

https://ieeexplore.ieee.org/document/7937898

Abstract:
文摘:
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.
我们解决了大规模的视觉位置识别问题,其中的任务是快速准确地识别给定查询照片的位置。
We present the following four principal contributions.
我们提出以下四项主要贡献。
First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task.
首先,我们开发了一个卷积神经网络(CNN)结构,该结构可以直接以端到端方式训练用于位置识别任务。
The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the “Vector of Locally Aggregated Descriptors” image representation commonly used in image retrieval.
该体系结构的主要组件NetVLAD是一种新的通用VLAD层,灵感来自于图像检索中常用的“局部聚合描述符向量”图像表示。
The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation.
该层可以很容易地插入到任何CNN架构中,并且可以通过反向传播进行训练。
Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture’s parameters from images depicting the same places over time downloaded from Google Street View Time Machine.
其次,我们创建了一个新的弱监督的排名损失,这使得端到端的学习架构的参数从图像描绘相同的地方随着时间的推移从谷歌街景时间机器下载。
Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks.
第三,我们开发了一个有效的训练程序,可以应用于非常大规模的弱标记任务。
Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.
最后,我们证明了所提出的架构和训练程序在挑战性的位置识别和图像检索基准上显著优于非学习图像表示和现成的CNN描述符。

SECTION 1Introduction
第一节介绍
Visual place recognition has received a significant amount of attention in the past years both in computer vision [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] and robotics communities [12] , [13], [14], [15], [16] motivated by, e.g., applications in autonomous driving [14], augmented reality [17] or geo-localizing archival imagery [18].
视觉识别已经收到了大量的关注在过去几年都在计算机视觉[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]和机器人社区[12],[13],[14],[15],[16]的动机,例如,应用程序在自动驾驶[14],增强现实技术[17]或geo-localizing档案图像[18]。
The place recognition problem, however, still remains extremely challenging.
然而,地点识别问题仍然极具挑战性。
How can we recognize the same street-corner in the entire city or on the scale of the entire country despite the fact it can be captured in different illuminations or change its appearance over time?
我们如何能够识别整个城市或整个国家的同一个街角,尽管它可以被不同的灯光捕捉到,或者随着时间的推移改变它的外观?
The fundamental scientific question is what is the appropriate representation of a place that is rich enough to distinguish similarly looking places yet compact to represent entire cities or countries.
最基本的科学问题是,一个地方是否足够丰富,能够区分出外观相似但紧凑到可以代表整个城市或国家的地方,这个地方的合适代表是什么。
The place recognition problem has been traditionally cast as an instance retrieval task, where the query image location is estimated using the locations of the most visually similar images obtained by querying a large geotagged database [1], [2], [3], [8], [9], [10].
传统上,位置识别问题被转换为一个实例检索任务,其中查询图像位置是通过查询一个带有地理标记的大型数据库[1]、[2]、[3]、[8]、[9]、[10]获得的最直观相似的图像位置来估计的。
Each database image is represented using local invariant features [19] such as SIFT [20] that are aggregated into a fixed length vector representation for the entire image such as bag-of-visual-words [21], [22], VLAD [23], [24] or Fisher vector [25], [26].
每个数据库图像都使用局部不变的特性19表示,这些特性被聚合为整个图像的固定长度向量表示形式,如可视词包[21]、[22]、VLAD[23]、[24]或Fisher向量[25]、[26]。
The resulting representation is then usually compressed and efficiently indexed [21], [27].
然后通常对得到的表示进行压缩,并有效地索引[21]和[27]。
The image database can be further augmented by 3D structure that enables recovery of accurate camera pose [4], [11], [28].
通过三维结构进一步扩充图像数据库,恢复准确的相机位姿[4]、[11]、[28]。
In the last few years, convolutional neural networks (CNNs) [29], [30] have emerged as powerful image representations for various category-level recognition tasks such as object classification [31], [32], [33], [34], scene recognition [35] or object detection [36].
近年来,卷积神经网络(CNNs)[29]、[30]已成为对象分类[31]、[32]、[33]、[34]、场景识别[35]、目标检测[36]等各种类别级识别任务的强大图像表示。
The basic principles of CNNs are known from 80’s [29], [30] and the recent successes are a combination of advances in GPU-based computation power together with large labelled image datasets [31].
CNNs的基本原理从80年代的[29]和[30]就已经知道了,最近的成功是基于gpu的计算能力的进步与大型标记图像数据集[31]的结合。
It has been shown that the trained representations are, to some extent, transferable between recognition tasks [32], [36], [37], [38], [39], and direct application of CNN representations trained for object classification [31] as black-box descriptor extractors has brought some improvements in performance on instance-level recognition tasks [40], [41], [42], [43], [44], [45], [46], [47].
表明训练表示,在某种程度上,转移之间的识别任务[32],[36],[37],[38],[39],和直接的应用训练对CNN表示对象分类[31]作为黑盒描述符提取器带来了一些性能改进实例级识别任务[40]、[41],[42],[43],[44],[45],[46],[47]。
In this work we investigate whether the performance can be further improved by CNN representations developed and trained directly for place recognition.
在这项工作中,我们研究是否可以通过直接开发和训练CNN表示来提高位置识别的性能。
This requires addressing the following four main challenges: First, what is a good CNN architecture for place recognition?
这需要解决以下四个主要挑战:首先,什么是用于位置识别的CNN架构?
Second, how to gather sufficient amount of annotated data for the training?
第二,如何为培训收集足够的带注释的数据?
Third, how can we train the developed architecture in an end-to-end manner tailored for the place recognition task?
第三,我们如何针对位置识别任务以端到端方式培训开发的体系结构?
Fourth, how to perform computationally efficient training in order to scale up to very large datasets?
第四,如何执行计算效率的培训,以扩大到非常大的数据集?
To address these challenges, we bring the following four innovations.
为了应对这些挑战,我们提出了以下四项创新。

First, building on the lessons learnt from the current well performing hand-engineered object retrieval and place recognition pipelines [10], [23] , [48], [49], we develop a convolutional neural network architecture for place recognition that aggregates mid-level (conv5) convolutional features extracted from the entire image into a compact fixed length vector representation amenable to efficient indexing.
首先,建筑在教训当前执行hand-engineered对象检索和识别管道[10],[23],[48],[49],我们开发一个卷积神经网络架构的地方承认骨料中层(conv5)卷积特性从整个图像中提取到一个紧凑的固定长度的向量表示服从高效的索引。
To achieve this, we design a new trainable generalized VLAD layer, NetVLAD, inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation [24] that has shown excellent performance in image retrieval and place recognition.
为了实现这一目标,我们设计了一个新的可训练的广义VLAD层NetVLAD,其灵感来自于局部聚集描述符(VLAD)表示[24]的向量,该向量在图像检索和位置识别方面表现出了良好的性能。
The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation.
该层可以很容易地插入到任何CNN架构中,并且可以通过反向传播进行训练。
The resulting aggregated representation is then compressed using Principal Component Analysis (PCA) to obtain the final compact descriptor of the image.
然后使用主成分分析(PCA)对生成的聚合表示进行压缩,以获得图像的最终压缩描述符。
Second, to train the architecture for place recognition, we gather a large dataset of multiple panoramic images depicting the same place from different viewpoints over time from the Google Street View Time Machine.
其次,为了训练位置识别的架构,我们从谷歌街景时间机器上收集了一个大的数据集,包含多个全景图像,这些图像从不同的角度描述了同一地点随时间的变化。
Such data is available for vast areas of the world, but provides only weak form of supervision: We know the two panoramas are captured at approximately similar positions based on their (noisy) GPS but we don’t know which parts of the panoramas depict the same parts of the scene.
这样的数据在世界上大部分地区都是可用的,但只提供了微弱的监督形式:我们知道这两幅全景图是基于它们(嘈杂的)GPS在大约相似的位置拍摄的,但我们不知道全景图的哪些部分描述了相同的场景。
Third, we create a new loss function, which enables end-to-end learning of the architecture’s parameters, tailored for the place recognition task from the weakly labelled Time Machine imagery.
第三,我们创建了一个新的损失函数,它使端到端学习的架构参数,从弱标记的时间机器图像为地点识别任务量身定制。
The loss function is also more widely applicable to other ranking tasks where large amounts of weakly labelled data are available.
损失函数也更广泛地适用于其他有大量弱标记数据可用的排序任务。
Fourth, we develop an efficient learning procedure which can be applied on very large-scale weakly labelled tasks.
第四,我们开发了一个有效的学习过程,可以应用于非常大规模的弱标记任务。
It requires only a fraction of the computational time of a naive implementation thanks to improved data efficiency through hard negative mining, combined with an effective use of caching.
由于通过硬负挖掘(hard negative mining)提高了数据效率,再加上有效地使用缓存,它只需要原始实现的一小部分计算时间。
The resulting representation is robust to changes in viewpoint and lighting conditions, while simultaneously learns to focus on the relevant parts of the image such as the building façades and the skyline,
由此产生的表现对视角和光照条件的变化具有很强的鲁棒性,同时学会关注图像的相关部分,如建筑立面和天际线,
while ignoring confusing elements such as cars and people that may occur at many different places ( Fig. 1). We show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.
而忽略了可能发生在许多不同地方的混淆元素,如汽车和人(图1)。我们表明,在挑战位置识别和图像检索基准方面,所提出的体系结构和训练过程显著优于非学习图像表示和现成的CNN描述符。

1.1 Related Work
1.1相关工作
While there have been many improvements in designing better image retrieval [22], [23], [24], [25], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62] and place recognition [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] systems, not many works have performed learning for these tasks.
虽然有很多的改进设计更好的图像检索[22],[23],[24],[25],[48],[49],[50],[51],[52],[53],[54],[55],[56],[57],[58],[59],[60],[61],[62]和地点识别[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16]系统,而不是许多作品表现学习这些任务。
All relevant learning-based approaches fall into one or both of the following two categories: (i) learning for an auxiliary task (e.g., some form of distinctiveness of local features [2], [9], [12], [63], [64], [65], [66]), and (ii) learning on top of shallow hand-engineered descriptors that cannot be fine-tuned for the target task [2], [6], [7], [48], [67].
基于所有相关学习方法分为一个或两个以下两类:(i)学习辅助的任务(例如,某种形式的特殊性的地方特性[2],[9],[12],[63],[64],[65],[66]),和(2)学习上的浅hand-engineered描述符无法调整为目标任务[2],[6],[7],[48],[67]。
Both of these are in spirit opposite to the core idea behind deep learning that has provided a major boost in performance in various recognition tasks: End-to-end learning.
这两者在本质上都与深度学习背后的核心理念背道而驰,深度学习在不同的识别任务(端到端学习)中提供了重要的性能提升。
We will indeed show in Section 6.2 that training representations directly for the end-task, place recognition, is crucial for obtaining good performance.
我们将在第6.2节中说明,直接针对最终任务(地点识别)的培训表示对于获得良好的性能至关重要。
Numerous works concentrate on learning better local descriptors or metrics to compare them [56], [59], [68], [69], [70], [71], [72], [73], [74], [75], but even though some of them show results on image retrieval, the descriptors are learnt on the task of matching local image patches, and not directly with image retrieval in mind.
了众多作品集中精力学习更好的局部描述符或指标来比较他们[56],[59],[68],[69],[70],[71],[72],[73],[74],[75],但即使他们中的一些人显示在图像检索结果,描述符是学习的任务匹配本地图像补丁,而不是直接与图像检索。
Some of them also make use of hand-engineered features to bootstrap the learning, i.e.
其中一些还利用手工设计的特性来引导学习,即
, to provide noisy training data [56], [59], [69], [70], [74].
,提供噪声训练数据[56],[59],[69],[70],[74]。
Several works have investigated using CNN-based features for image retrieval.
利用基于cnn的特征对图像检索进行了研究。
These include treating activations from certain layers directly as descriptors by concatenating them [43], [76], or by pooling [40], [41], [42], [45], [46], [47].
这些方法包括将某些层的激活直接作为描述符处理,方法是将它们串联起来,如[43],[76],或者将[40]、[41]、[42]、[45]、[46]、[47]合用。
However, none of these works actually train the CNNs for the task at hand, but use CNNs as black-box descriptor extractors.
然而,这些工作实际上都没有训练CNNs来完成手头的任务,而是使用CNNs作为黑盒描述符提取器。
One exception is the work of Babenko et al. [76] in which the network is fine-tuned on an auxiliary task of classifying 700 landmarks.
一个例外是Babenko等人[76]的工作,他们对网络进行了微调,完成了对700个地标进行分类的辅助任务。
However, again the network is not trained directly on the target retrieval task.
然而,网络并没有直接针对目标检索任务进行训练。
Very recent works [77], [78], published after the first version of this paper [79], train CNNs end-to-end for image retrieval by making use of image correspondences obtained from structure-from-motion models, i.e.
在本文第一个版本之后发表的非常近期的文献[77]、[78][79],利用从结构-运动模型中获得的图像对应关系,即
, they rely on pre-existing image retrieval pipelines based on precise matching of RootSIFT descriptors, spatial verification and bundle adjustment.
它们依赖于现有的基于精确匹配RootSIFT描述符、空间验证和束调整的图像检索管道。
Weyand et al. [80] proposed a CNN-based method for geo-localization by partitioning the Earth into cells and treating place recognition as a classification task.
Weyand等[80]提出了一种基于cnn的地理定位方法,将地球划分为单元,将位置识别作为分类任务。
While providing impressive rough city/country level estimates of where a photo is taken, their method is not capable of providing several-meter accuracy place recognition that we consider here, as their errors are measured in tens and hundreds of kilometres.
虽然他们提供了令人印象深刻的粗略的城市/国家水平的照片拍摄地点估计,但他们的方法不能提供我们在这里考虑的几米精度的地点识别,因为他们的误差是在几十公里和几百公里测量的。
Finally, [81] and [82] performed end-to-end learning for different but related tasks of ground-to-aerial matching [82] and camera pose estimation [81].
最后,[81]和[82]对不同但相关的地空匹配任务和相机姿态估计任务进行端到端学习[82]。

场景识别之NetVLAD
https://www.jianshu.com/p/7d48bff4d1c3
论文笔记:NetVLAD: CNN architecture for weakly supervised place recognition
http://www.liuxiao.org/2019/02/%E8%AE%BA%E6%96%87%E7%AC%94%E8%AE%B0%EF%BC%9Anetvlad-cnn-architecture-for-weakly-supervised-place-recognition/
vlad的python简单实现
https://github.com/Lithogenous/VLAD-SIFT-python/blob/master/vlad_raw.py

<think>好的,我现在需要帮助用户了解多类标记Transformer(MCTformer)在弱监督语义分割中的应用或实现。首先,我得先回顾一下用户提供的引用内容,看看有哪些关键点需要覆盖。 根据引用[2]和[4],MCTformer的核心是使用多个类别特定的token,替代传统ViT中的单一cls token。这样每个类别的token可以生成对应的注意力图,从而定位不同类别的对象。这点很重要,需要详细解释。 然后,用户提到实现和应用。应用方面,引用[5]指出MCTformer能生成高质量的伪标签,用于训练语义分割模型。实现的话,可能需要包括模型架构,比如多类token的设计,类感知训练策略,以及如何结合补丁到补丁的注意力来优化定位图,如引用[2]所述。 还有,引用[3]讨论的过度平滑问题,MCTformer可能通过多类token和注意力机制来缓解,这也是一个值得提到的点。此外,如何与CAM互补,如引用[2]和[5]提到的,需要说明两者结合的优势。 用户可能需要具体的实现步骤或代码结构,但根据引用中没有提供代码,可能需要从架构层面描述。例如,模型由哪些部分组成,训练时如何利用多类token,如何计算损失等。 需要确保回答结构清晰,分点说明原理、实现步骤和应用场景,并引用正确的来源。同时,要使用中文,数学公式用$...$,独立公式用$$...$$,但根据问题描述,可能不需要太多公式,但如果有的话要注意格式。 最后,生成相关问题,帮助用户进一步探索。比如优化方法、与其他技术的对比、实验效果等。这些问题应该基于引用内容,比如引用[3]提到的过度平滑解决方案,可以作为相关问题之一。 现在,整理这些思路,确保覆盖所有关键点,引用正确的文献,并保持回答的自然流畅。</think>### 多类标记Transformer(MCTformer)在弱监督语义分割中的应用与实现 #### 核心原理 1. **多类标记设计** 传统视觉Transformer(ViT)仅使用单个`[CLS]`标记生成类别无关的注意力图(class-agnostic)。而MCTformer引入与类别数量相等的多个`[CLS]`标记,每个标记对应一个特定类别(class-specific)。通过类到补丁(class-to-patch)的注意力机制,可直接定位不同类别对象的空间区域[^4]。 **数学表示**:设类别数为$C$,则注意力图矩阵为$A \in \mathbb{R}^{C \times N}$($N$为图像块数量),每个类别对应一行注意力权重。 2. **补丁间关联优化** 通过补丁到补丁(patch-to-patch)的Transformer自注意力,捕捉图像块之间的语义关联性。例如,若两个补丁的注意力权重较高,则可能属于同一物体。这有助于细化初始定位图,解决边界模糊问题[^2]。 3. **与CAM的互补性** MCTformer生成的注意力图可与传统的类激活图(CAM)结合: $$ L_{final} = \alpha L_{MCT} + (1-\alpha)L_{CAM} $$ 其中$\alpha$为权重系数,实验表明两者结合可显著提升伪标签质量[^5]。 #### 实现步骤 1. **模型架构** - **输入**:图像分割为$N$个补丁,嵌入为序列$X \in \mathbb{R}^{N \times d}$。 - **多类标记**:添加$C$个可学习的类别标记$\{t_1, t_2, ..., t_C\}$,构成输入序列$[t_1, t_2, ..., t_C; X]$。 - **Transformer编码器**:通过多层自注意力计算,输出标记与补丁的交互特征。 2. **类感知训练策略** - 仅用图像级类别标签监督,损失函数为多标签分类损失: $$ \mathcal{L}_{cls} = -\sum_{c=1}^C y_c \log(p_c) $$ 其中$y_c$为类别$c$的标签,$p_c$为预测概率。 - **注意力图生成**:从最后一层Transformer提取类标记与补丁的注意力权重,归一化后得到定位热图。 3. **定位图优化** - 利用低层Transformer的patch-to-patch注意力构建亲和矩阵$S \in \mathbb{R}^{N \times N}$,通过随机游走(Random Walk)优化初始定位图: $$ \hat{A}_c = S \cdot A_c $$ 其中$A_c$为类别$c$的初始注意力图。 #### 应用场景 1. **伪标签生成** MCTformer输出的定位图可直接作为语义分割模型的训练标签。例如,在PASCAL VOC数据集上,其生成的伪标签mIoU达到68.2%,优于CAM基线(56.3%)[^5]。 2. **与其他技术的结合** - **弱监督目标检测**:定位图可扩展用于对象边界框预测。 - **医学图像分析**:在标注稀缺的医疗影像(如肿瘤分割)中减少人工标注成本。 #### 代码框架(伪代码) ```python class MCTformer(nn.Module): def __init__(self, num_classes, embed_dim): super().__init__() self.cls_tokens = nn.Parameter(torch.randn(num_classes, embed_dim)) # 多类标记 self.transformer = TransformerEncoder() # 标准Transformer编码器 def forward(self, x): # x: [batch_size, num_patches, embed_dim] cls_tokens = self.cls_tokens.unsqueeze(0).expand(x.size(0), -1, -1) x = torch.cat([cls_tokens, x], dim=1) # 拼接多类标记 x = self.transformer(x) # 提取类标记与补丁的注意力权重 attn_weights = self.transformer.get_last_selfattention()[:, :, :num_classes, :] return attn_weights ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值