Google Universal Image Embedding前五名方案小结_google universal image embedding challenge 北大-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_40779727/article/details/136352790

	info
竞赛地址	leaderboard: https://www.kaggle.com/competitions/google-universal-image-embedding/leaderboard
个人博客位置	http://myhz0606.com/article/guie

leaderboard排名

index	Score	paper&code
1	0.728	https://arxiv.org/abs/2210.08473https://github.com/ShihaoShao-GH/1st-Place-Solution-in-Google-Universal-Image-Embedding
2	0.709	https://github.com/rainbow-xiao/ECCV2022-ILR-workshophttps://arxiv.org/pdf/2210.08735.pdf
3	0.692	未公开
4	0.688	https://github.com/IvanAer/G-Universal-CLIP
5	0.688	github: https://github.com/riron1206/kaggle-Google-Universal-Image-Embedding-Competition-5th-Place-Solution paper: https://arxiv.org/abs/2210.09495

TL, DR

base model 建议

预训练模型的训练数据越大，质量越高最后的效果越好。前五名都用到了CLIP的VIT-H模型作为base model
用Arcface作为优化目标。（backbone后接两层fc，最后一层fc的特征作为优化目标）

data engineering 建议

好的数据集对结果影响很大，大家都认为product10K数据集最好，GLDv-2其次。
数据做好类别均衡有益于提升效果。

trick 建议

LP-FT或交替训练head和backbone效果更好。backbone权重的更新幅度（即学习率）需要低于head。
多尺度训练有助于提升效果
ensemble、TTS有助于提升效果。（需要更多算力）

1st method

github: https://github.com/LouieShao/1st-Place-Solution-in-Google-Universal-Image-Embedding

Paper: https://arxiv.org/abs/2210.08473

先尝试off-the-shelf (直接用现成模型提特征不训练)方法

model	input_size	num_classes	public_score
swin_large_patch4_window12_384_in22k	384	21843	0.405
swin_large_patch4_window7_224_in22k	224	21843	0.400
swinv2_large_window12_192_22k	192	21843	0.392
convnext_xlarge_in22k	224	21843	0.385
swinv2_large_window12to24_192to384_22kft1k	384	1000	0.384
swinv2_large_window12to16_192to256_22kft1k	256	1000	0.382
convnext_xlarge_384_in22ft1k	384	1000	0.382
convnext_large_384_in22ft1k	384	1000	0.375
convnext_large_in22k	224	21843	0.371
convnext_base_384_in22ft1k	384	1000	0.371
beit_large_patch16_224_in22k	224	21843	0.136

consensus：

预训练模型所用的数据规模越大越好（和我之前得出的结论一致）

因此作者采用CLIP作为base model（VIT-L Laion-400M 31ep）

index	Method	traning	Score	Improvement
1	VIT-L Laion-400M 31ep	False	0.499	0
2	+用GLDv2数据集finetune fc层 (6 epoch), 优化目标为arcface （margin设置成0.5， s=30）	True	0.56	6%
3	+ 用更多数据训练（Products-10K [2], Shopee [3], MET Artwork Dataset [4], Alibaba goods [5], H&M Personalized Fashion [6], GPR1200 [7], GLDv2-Full [8], DeepFashion Consumer-to-shop Clothes Retrieval Benchmark part）	True	0.61	5%
4	+ 交替训练fc和backbone （先用上面方法训练fc，在固定fc仅训练backbone，此时学习率降低10倍，训练3个epoch）	True	0.65	4%
5	+ 用交替训练的方法在products-10k上微调（作者发现这个数据集对结果影响最大）	True	0.671	2.1%
6	+ ensemble 不同分辨率下的模型（240， 280）	True	0.680	0.9%
7	+replace backbone（替换为Laion2b训练的VIT-H，重复2-5这几步）	True	0.703	2.3%
8	+fine-tuning on 280	True	0.705	0.2%
9	+再次在280上交替训练fc和backbone (更少的epoch和学习率)	True	0.723	1.8%
10	+再次在290上交替训练fc和backbone (patch overlap为4个pixel)	True	0.728	0.5%

几个比较重要的结论：

预训练模型所用数据集越大越好（base model）
微调阶段的数据集越多越好，数据质量越高越好
用arcface作为优化目标可行
fc和backbone交替训练的方法比LP-FT好
在更高分辨率微调结合overlap patch有益于进一步提升精度。（即多尺度方法）

LP-FT方法：同时微调FC和backbone但分配不同的学习率。
作者这个交替训练的方法是：先固定backbone训练FC，在固定FC训练backbone

其他细节：

作者用了2个FC，最后一个投影到64维作为检索，这个64维度的向量用arcface计算loss
两个FC之间用了dropout层，ratio=0.2
SGD with momentum as our optimizer, and an L2 weights decay rate of 1.5e-4

2nd method

github: https://github.com/rainbow-xiao/ECCV2022-ILR-workshop

paper: https://arxiv.org/pdf/2210.08735.pdf

第二名的模型架构与第一名相同

open_clip_torch: Vit-H14-224-visual as backbone
fc with dropout rate=0.2 as neck
arcface head.

一些trick

数据balance。maximum_image_perclass=100, minimum_image_perclass=3, 并resample到30
动态margin。设置arcface的 init m=0.1, 随后 m= m+0.1 *(cur_ep-1), max_m = 0.8.
分层学习率（stratified lr）。作者说这个trick取得了非常大的提升。作者用1e-4训练fc。用1e-4 * 0.001训练backbone。backbone只训练两个epoch，随后固定，再单独训练FC1个epoch。

没有用ensemble和高分辨率微调取得了0.713的效果。

4st-method

github: https://github.com/IvanAer/G-Universal-CLIP

第四名的方案太过暴力，落地成本过大。

模型架构

用了2个CLIP架构：ViT-L-14-336和ViT-H-14

其中ViT-L-14-336架构用不同参数训练4个模型，ViT-H-14架构用不同参数训练了5个模型，总计9个模型。先按照架构合并特征，再concatenate，最后PCA将为到64.

训练数据：

作者仅用了Google Landmarks 2020和Products-10k

训练pipeline

在这里插入图片描述

推理pipeline

在这里插入图片描述

5st method

github: https://github.com/riron1206/kaggle-Google-Universal-Image-Embedding-Competition-5th-Place-Solution
paper: https://arxiv.org/abs/2210.09495

模型架构

backbone：采用了CLIP Vit-H14-224-visual as backbone

head: BatchNorm1d(num_features=1024+3, affine=False) -> Linear(num_features=1024+3, out_features=64)

loss: ArcFace(scale=30.0, margin=0.5)

在这里插入图片描述

作者此处直接将原图按照宽高比resize后与feature map进行拼接，来捕获多尺度特征，感觉不太优雅，特征跨度太大。

训练数据

Dataset	Classes	Samples
GLDv2	5000	249350
Products10k	9691	141931
GPR1200 landmark	200	2000
GPR1200 face	200	2000
GPR1200 instre	200	2000
GPR1200 sketch	200	2000
GPR1200 sop	200	2000
Food101	101	3030