《Cross-Modal Retrieval in the Cooking Context__Learning Semantic Text-Image Embeddings》

《Cross-Modal Retrieval in the Cooking Context》介绍了AdaMine模型,它通过联合检索和分类损失在图像和文本间建立语义嵌入。模型在Recipe1M数据集上进行验证,使用双三元组学习方案,结合ResNet-50和双向LSTM对图像和文本特征进行学习。实验显示,自适应学习策略提高了检索性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

论文地址:

https://arxiv.org/pdf/1804.11146.pdf​arxiv.org

来源:ACM SIGIR2018(暂未发布源码)

 

 

一、Introduction:

文章要做的事情(recipe retreival):

输入:image(sentence)+dataset      输出:sentence(image) rank list

在本文中,我们提出了一个跨模态检索模型,在共享表示空间中对齐视觉和文本数据(如菜肴图片和食谱)。我们描述了一个有效的学习方案,能够解决大规模问题,并在包含近100万个图片配方对的Recipe1M数据集上进行验证。

 

二、Contributions:

1.提出了一个跨模态检索及分类损失的联合目标函数来构建潜在空间。

2.提出了一个双重三元组来共同表达

1)检索损失(例:比萨的配方和相对应的图片的距离在潜在空间中比其他的图片更接近,见图1b中的蓝色箭头)

2)类别损失(例:在潜在空间中,两种披萨的距离,比起,比萨和任何来自其他类别的另一个物品,如沙拉更接近)。我们基于类别的损失直接作用于特征空间,而不是向模型添加分类层。

3.提出了一种新方法来调整随机梯度下降法中的梯度更新和用于训练模型的反向传播算法。

 

 

三、Model:

本文提出的模型AdaMine(ADAptive MINing Embe

基于对抗的跨媒体检索Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值