论文笔记--An Overview of Cross-Media Retrieval: Concepts, Methodologies,...-2018- (二)

本文全面回顾了跨媒体检索领域的最新进展,涵盖了概念、方法、评估基准及挑战。重点介绍了跨媒体检索数据集、特征抽取、评估指标和实验结果,探讨了深度神经网络在跨媒体检索中的应用前景。

论文信息:
论文-An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges-2018-彭宇新
文末附50+篇跨媒体相关的英文论文下载地址


笔记包括两个部分。一是关于跨媒体检索相关概念和方法(1.1-1.7),二是跨媒体检索的实验和总结挑战(1.8-1.11)。整体目录如:

(一)跨媒体检索相关概念和方法
(二)跨媒体检索的实验和总结挑战


An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges(二)

文献引用格式

Y. Peng, X. Huang and Y. Zhao, “An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2372-2385, Sept. 2018. doi: 10.1109/TCSVT.2017.2705068

跨媒体检索数据集

论文总结了几个流行数据集的频率如下图所示,表1.4是主要几种数据集的详细介绍,包括其获取方式。

在这里插入图片描述

图1.2 跨媒体数据集使用频率截图

表1.4 跨媒体数据集介绍

数据集名称简介规模优缺点
Wikipedia Dataset基于维基百科的精选(推荐/专题)文章类别: 29(10) 图像文本对: 2866优:使用广泛,跨媒体检索的重要基准数据集 缺:部分类别难以区分,如战争和历史
XMedia Dataset [7][10][11]XMedia数据集是第一个包含五种媒体类型(文本/图像/视频/音频/3D模型)的跨媒体数据集类别: 20 每个类别: 600 每个类别分布250t,250i,25v,50a,25D优:提供5种跨媒体类型数据,规模大,类别区分鲜明 缺:应用不够广泛
NUS-WIDE Dataset [101]由新加坡国立大学媒体研究实验室通过网络爬虫从Flickr采集得到,包括图像和对应的标签类别: 81 图像: 269648 标签: 5108优:应用广泛,类别多,数据量大 缺:仅包含两种模态数据;非专业用于跨媒体检索
Pascal VOC 2007 Dataset 官网 [103]Pascal视觉对象类(VOC)挑战[103]是视觉对象类别检测和识别的基准类别: 20 图像: 9963优:在VOC比赛受欢迎 缺:非专业用于跨媒体检索
Clickture Dataset [104]Clickture数据集是一个大规模的基于点击的图像数据集,它是从商业图像搜索引擎的一年点击数据中收集的图像: 4000万(完整);100万(子集) 文本: 7360万(完整);1170万(子集)优:数据规模大 缺:未提供监督信息(如标签)

实验

特征抽取和数据集划分

特征抽取
对于Wikipedia, XMedia 和 Clickture数据集,采用相同的方法生成文本和图像的表示。例如,文本使用10-topic LDA 模型的直方图表示。图像使用具有128个码字的SIFT codebook的视觉词袋(BoVW)直方图表示。视频先将其分成几个视频镜头,然后使用128维的BoVW方法抽取关键帧的直方图特征。音频使用29维的MFCC特征表示。3D模型由LightField描述符集的级联4,700维向量表示。
对于NUS-WIDE dataset数据集,文本使用1000维的词频特征,图像使用500维的BoVM特征。等等。
数据集划分

表1.5 数据集划分

WikipediaXMediaNUS-WIDEPascal VOC 2007
Train2173960058,6205011/2808(selected)
test693240038,9554952/2841
ratio3:14:13:21:1

评估指标和比较方法

两种检索任务,对应两种评价指标。

  1. 多模态跨媒体检索。 通过提交任何媒体类型的查询示例,将检索所有媒体类型。
  2. 双模跨媒体检索。 通过提交任何媒体类型的查询示例,将检索其他媒体类型。
    使用MAP分数评估检索结果。MAP在信息检索中被广泛采用,详细介绍见网页:
    http://blog.sina.com.cn/s/blog_662234020100pozd.html
    https://blog.youkuaiyun.com/sunshine__0411/article/details/83501942

实验结果

论文使用的算法如下:
BITR[20], CCA[18], CCA+SMN[27], CFA[30], CMCP[6], DCMIT[35], HSNN[5], JGRHML[7], JRL[10], LGCFL[85], ml-CCA[26], mv-CCA[25] and S2UPG[11]
多模态检索结果,标记橙色的结果最好,大多数情况S22UPG方法表现最好。
在这里插入图片描述

图1.3 多模态检索结果

双模态检索结果,同上。S22UPG方法依然最好。
论文中的解释是:S22UPG achieves the best results because it adopts the media patches to model fine-grained correlations, and the unified hypergraph can jointly model data from all media types, so as to fully exploit the correlations among them.(S22UPG为作者另一篇论文[11])
在这里插入图片描述
图1.4双模态检索结果

对比实验

论文在Wikipedia,XMedia和Clickture数据集上进行实验,其中文本使用BoW特征,图像和视频使用CNN特征。图1.5是跨模态和双模态跨媒体检索任务的所有MAP分数的平均值,详细结果以及其他数据集的结果见网站。
实验结果表明,CNN特征提高了大多数方法的性能,而BoW特征的性能并不稳定。
在这里插入图片描述

图1.5 对比实验结果

挑战和开放性问题

数据集构建和基准测试

为构建高质量的数据集,应考虑以下几个方面的问题:

  1. 哪些类应该被加入到数据集中?
  2. 应该涉及有多少媒体类型?
  3. 数据集应该多大合适?
    为解决这些问题,论文作者团队构建了一个新的数据集XMediaNet。网址是:http://www.icst.pku.edu.cn/mipl/XMedia. 作者介绍其数据类别有200个,实例将超过10万。

提高准确率和效率

在提高准确性方面,跨媒体相似性测量的基于图的方法可以使用更多关于上下文信息用于图的构建,如链接关系。另外,单个媒体的辨别能力也很重要,当采用更多辨别特征如CNN时,检索精度将提高。
在效率方面,尽管有一些哈希方法效率较高,但还未引起足够重视。随着数据集扩大,研究人员将更方便的评估其方法的效率。

深度神经网络的应用

DNN旨在模拟人脑的神经元结构,可以自然地处理不同媒体类型的相关性,因此值得尝试利用DNN来弥合“媒体鸿沟”。 DNN的应用仍然是跨媒体检索的研究热点,就像单媒体检索一样。一方面,现有方法主要以单媒体特征为输入,因此它们在很大程度上取决于特征的有效性。研究工作可以致力于设计用于跨媒体检索的端到端架构,其将原始媒体实例作为输入(例如,原始图像和音频剪辑),并且直接通过DNN获得检索结果。用于特定媒体类型的一些特殊网络(例如,用于对象区域检测的R-CNN [58])也可以合并到跨媒体检索的统一框架中。另一方面,大多数现有方法仅针对两种媒体类型设计。在未来的工作中,研究人员可以专注于共同分析两种以上的媒体类型,这将使DNN在跨媒体检索中的应用更加灵活和有效。

上下文关联信息的探索

跨媒体检索的主要挑战仍然是不同媒体类型的异构形式。跨媒体相关性通常与上下文信息有关。例如,如果图像和音频剪辑来自具有链接关系的两个网页,则它们可能彼此相关。许多现有方法(例如,CCA,CFA和JRL)仅将共存关系和语义类别标签作为训练信息,但忽略丰富的上下文信息。因此更应关注上下文信息,以提高跨媒体检索性能。

跨媒体的实际应用

随着跨媒体检索的准确率和效率的不断提高,其应用场景将更加广泛。用户可以更方便的方法从大规模跨媒体数据中检索,并希望采用跨媒体搜索引擎,该搜索引擎能够通过一种任何媒体类型的查询检索各种媒体类型,如文本,图像,视频,音频和3D模型。 此外,其他可能的应用场景包括涉及跨媒体数据的企业,如电视台,媒体公司,数字图书馆和出版公司。互联网和相关企业都将对跨媒体检索有巨大的需求。

总结

跨媒体检索是一个重要的研究课题,旨在处理跨越不同媒体类型执行检索的“媒体鸿沟”。本文回顾了100多篇参考文献,介绍了跨媒体检索的概述,建立评估基准,以及促进相关研究。已经介绍了现有方法,主要包括公共空间学习和跨媒体相似性测量方法。公共空间学习方法明确地学习用于不同媒体类型的公共空间以执行检索,而跨媒体相似性测量方法直接测量跨媒体相似性而没有公共空间。还介绍了广泛使用的跨媒体检索数据集,包括Wikipedia,XMedia,NUS-WIDE,Pascal VOC 2007和Clickture Datasets。在这些数据集中,我们构建的XMedia是第一个具有五种媒体类型的数据集,并且正在进一步构建一个包含五种媒体类型和100,000多个实例的新数据集XMediaNet。此外给出了跨媒体基准,例如数据集,比较方法,评估指标和实验结果,并且我们建立了一个不断更新的网站来呈现它们。最后,在未来的工作中也提出了主要挑战和未解决的问题。我们希望这些可以吸引更多的研究人员专注于跨媒体检索,并促进相关的研究和应用。

资源

50+篇跨媒体相关英文论文

下载地址

附:论文的参考文献

[1] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 2, no. 1, pp. 1–19, 2006.
[2] S. Clinchant, J. Ah-Pine, and G. Csurka, “Semantic combination of textual and visual information in multimedia retrieval,” in Proc. ACM Int. Conf. Multimedia Retr. (ICMR), 2011, p. 44.
[3] Y. Liu, W.-L. Zhao, C.-W. Ngo, C.-S. Xu, and H.-Q. Lu, “Coherent bagof audio words model for efficient large-scale video copy detection,” in Proc. ACM Int. Conf. Image Video Retr. (CIVR), 2010, pp. 89–96.
[4] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, “Ranking with local regression and global alignment for cross media retrieval,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2009, pp. 175–184.
[5] X. Zhai, Y. Peng, and J. Xiao, “Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval,” in Proc. Int. Conf. MultiMedia Modeling (MMM), 2012, pp. 312–322.
[6] X. Zhai, Y. Peng, and J. Xiao, “Cross-modality correlation propagation for cross-media retrieval,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2012, pp. 2337–2340.
[7] X. Zhai, Y. Peng, and J. Xiao, “Heterogeneous metric learning with joint graph regularization for cross-media retrieval,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2013, pp. 1198–1204.
[8] D. Ma, X. Zhai, and Y. Peng, “Cross-media retrieval by cluster-based correlation analysis,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2013, pp. 3986–3990.
[9] X. Zhai, Y. Peng, and J. Xiao, “Cross-media retrieval by intra-media and inter-media correlation mining,” Multimedia Syst., vol. 19, no. 5, pp. 395–406, Oct. 2013.
[10] X. Zhai, Y. Peng, and J. Xiao, “Learning cross-media joint representation with sparse and semisupervised regularization,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 6, pp. 965–978, Jun. 2014.
[11] Y. Peng, X. Zhai, Y. Zhao, and X. Huang, “Semi-supervised crossmedia feature learning with unified patch graph regularization,” IEEE Trans. Circuits Syst. Video Technol., vol. 26, no. 3, pp. 583–596, Mar. 2016.
[12] Y. Peng, X. Huang, and J. Qi, “Cross-media shared representation by hierarchical learning with multiple deep networks,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2016, pp. 3846–3853.
[13] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2003, pp. 119–126.
[14] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. (2014). “Deep captioning with multimodal recurrent neural networks (m-RNN).” [Online]. Available: https://arxiv.org/abs/1412.6632
[15] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3156–3164.
[16] J. Tang, X. Shu, Z. Li, G.-J. Qi, and J. Wang, “Generalized deep transfer networks for knowledge propagation in heterogeneous domains,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 12, no. 4s, pp. 68:1–68:22, 2016.
[17] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[18] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, nos. 3–4, pp. 321–377, 1936.
[19] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004.
[20] Y. Verma and C. V. Jawahar, “Im2Text and Text2Im: Associating images and texts for cross-modal retrieval,” in Proc. Brit. Mach. Vis. Conf. (BMVC), 2014, pp. 1–13.
[21] B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Associating neural word embeddings with deep image representations using fisher vectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4437–4446.
[22] S. Akaho. (2006). “A kernel method for canonical correlation analysis.” [Online]. Available: https://arxiv.org/abs/cs/0609071
[23] N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, “Cluster canonical correlation analysis,” in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), 2014, pp. 823–831.
[24] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 3408–3415.
[25] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling Internet images, tags, and their semantics,” Int. J. Comput. Vis., vol. 106, no. 2, pp. 210–233, Jan. 2014.
[26] V. Ranjan, N. Rasiwasia, and C. V. Jawahar, “Multi-label cross-modal retrieval,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,pp. 4094–4102.
[27] N. Rasiwasia et al., “A new approach to cross-modal multimedia retrieval,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2010, pp. 251–260.
[28] J. C. Pereira et al., “On the role of correlation and abstraction in crossmodal multimedia retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 521–535, Mar. 2014.
[29] A. Sharma, A. Kumar, H. Daume, III, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 2160–2167.
[30] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia content processing through cross-modal association,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2003, pp. 604–611.
[31] A. Frome et al., “DeViSE: A deep visual-semantic embedding model,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2013, pp. 2121–2129.
[32] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proc. Int. Conf. Mach. Learn. (ICML), 2014, pp. 595–603.
[33] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proc. Int. Conf. Mach. Learn. (ICML), 2011, pp. 689–696.
[34] N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 2222–2230.
[35] F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3441–3450.
[36] F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with correspondence autoencoder,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2014, pp. 7–16.
[37] H. Zhang, Y. Yang, H. Luan, S. Yang, and T.-S. Chua, “Start from scratch: Towards automatically identifying, modeling, and naming visual attributes,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2014, pp. 187–196.
[38] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep multiview representation learning,” in Proc. Int. Conf. Mach. Learn. (ICML), 2015, pp. 1083–1092.
[39] F. Wu et al., “Learning of multimodal representations with random walks on the click graph,” IEEE Trans. Image Process., vol. 25, no. 2, pp. 630–642, Feb. 2016.
[40] Y. Wei et al., “Cross-modal retrieval with CNN visual features: A new baseline,” IEEE Trans. Cybern., vol. 47, no. 2, pp. 449–460, Feb. 2017.
[41] Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,” IEEE Trans. Multimedia, vol. 18, no. 7, pp. 1363–1377, Jul. 2016.
[42] W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, “Effective multi-modal retrieval based on stacked auto-encoders,” in Proc. Int. Conf. Very Large Data Bases (VLDB), 2014, pp. 649–660.
[43] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba, “Learning aligned cross-modal representations from weakly aligned data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2940–2949.
[44] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. Int. Conf. Mach. Learn. (ICML), 2016, pp. 1060–1069.
[45] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2016, pp. 217–225.
[46] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2014, pp. 2672–2680.
[47] I. Goodfellow. (2017). “NIPS 2016 tutorial: Generative adversarial networks.” [Online]. Available: https://arxiv.org/abs/1701.00160
[48] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semisupervised learning on large graphs,” Learning Theory. Berlin, Germany: Springer, 2004, pp. 624–638.
[49] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multi-modal hashing,” IEEE Trans. Multimedia, vol. 16, no. 2, pp. 427–439, Feb. 2014.
[50] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selection and subspace learning for cross-modal retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2010–2023, Oct. 2015.
[51] J. Liang, Z. Li, D. Cao, R. He, and J. Wang, “Self-paced cross-modal subspace matching,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2016, pp. 569–578.
[52] N. Quadrianto and C. H. Lamppert, “Learning multi-view neighborhood preserving projections,” in Proc. Int. Conf. Mach. Learn. (ICML), 2011, pp. 425–432.
[53] B. McFee and G. Lanckriet, “Metric learning to rank,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 775–782.
[54] W. Wu, J. Xu, and H. Li, “Learning similarity function between objects in heterogeneous spaces,” Microsoft Res., Redmond, WA, USA, Tech. Rep. MSR-TR-2010-86, 2010.
[55] B. Bai et al., “Learning to rank with (a lot of) word features,” Inf. Retr., vol. 13, no. 3, pp. 291–314, Jun. 2010.
[56] D. Grangier and S. Bengio, “A discriminative kernel-based approach to rank images from text queries,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 8, pp. 1371–1384, Aug. 2008.
[57] F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang, “Cross-media semantic representation via bi-directional learning to rank,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2013, pp. 877–886.
[58] X. Jiang et al., “Deep compositional cross-modal learning to rank via local-global alignment,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2015, pp. 69–78.
[59] F. Wu et al., “Cross-modal learning to rank via latent joint representation,” IEEE Trans. Image Process., vol. 24, no. 5, pp. 1497–1509, May 2015.
[60] G. Monaci, P. Jost, P. Vandergheynst, B. Mailhe, S. Lesage, and R. Gribonval, “Learning multimodal dictionaries,” IEEE Trans. Image Process., vol. 16, no. 9, pp. 2272–2283, Sep. 2007.
[61] Y. Jia, M. Salzmann, and T. Darrell, “Factorized latent spaces with structured sparsity,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2010, pp. 982–990.
[62] F. Zhu, L. Shao, and M. Yu, “Cross-modality submodular dictionary learning for information retrieval,” in Proc. ACM Int. Conf. Conf. Inf. Knowl. Manage. (CIKM), 2014, pp. 1479–1488.
[63] S. Wang, L. Zhang, Y. Liang, and Q. Pan, “Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 2216–2223.
[64] Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu, “Supervised coupled dictionary learning with group structures for multi-modal retrieval,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2013, pp. 1070–1076.
[65] J. Wang, W. Liu, S. Kumar, and S.-F. Chang, “Learning to hash for indexing big data—A survey,” Proc. IEEE, vol. 104, no. 1, pp. 34–57, Jan. 2016.
[66] J. Tang, Z. Li, M. Wang, and R. Zhao, “Neighborhood discriminant hashing for large-scale image retrieval,” IEEE Trans. Image Process., vol. 24, no. 9, pp. 2827–2840, Sep. 2015.
[67] S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity search,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2011, pp. 1360–1365.
[68] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2011, pp. 225–234.
[69] Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multimodal data,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 1376–1384.
[70] Y. Hu, Z. Jin, H. Ren, D. Cai, and X. He, “Iterative multiview hashing for cross media indexing,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2014, pp. 527–536.
[71] Y. Zhen and D.-Y. Yeung, “A probabilistic model for multimodal hash function learning,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (SIGKDD), 2012, pp. 940–948.
[72] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast cross-media retrieval,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2014, pp. 395–404.
[73] M. Long, Y. Cao, J. Wang, and P. S. Yu, “Composite correlation quantization for efficient multimodal retrieval,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2016, pp. 579–588.
[74] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through cross-modality metric learning using similarity-sensitive hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2010, pp. 3594–3601.
[75] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Predictable dual-view hashing,” in Proc. Int. Conf. Mach. Learn. (ICML), 2013, pp. 1328–1336.
[76] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 2083–2090.
[77] D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao, “Parametric local multimodal hashing for cross-view similarity search,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2013, pp. 2754–2760.
[78] Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao, “Crossmedia hashing with neural networks,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2014, pp. 901–904.
[79] Q. Wang, L. Si, and B. Shen, “Learning to hash on partial multimodal data,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2015, pp. 3904–3910.
[80] D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodal hashing for cross-media retrieval,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2015, pp. 3890–3896.
[81] H. Liu, R. Ji, Y. Wu, and G. Hua, “Supervised matrix factorization for cross-modality hashing,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2016, pp. 1767–1773.
[82] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal hashing with orthogonal regularization,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2015, pp. 2291–2297.
[83] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for cross-modal similarity search,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2014, pp. 415–424.
[84] L. Zhang, Y. Zhao, Z. Zhu, S. Wei, and X. Wu, “Mining semantically consistent patterns for cross-view data,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 11, pp. 2745–2758, Nov. 2014.
[85] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,” IEEE Trans. Multimedia, vol. 17, no. 3, pp. 370–381, Mar. 2015.
[86] Y. Hua, S. Wang, S. Liu, Q. Huang, and A. Cai, “TINA: Cross-modal correlation learning by adaptive hierarchical semantic aggregation,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Dec. 2014, pp. 190–199.
[87] Y. Wei et al., “Modality-dependent cross-media retrieval,” ACM Trans. Intell. Syst. Technol., vol. 7, no. 4, pp. 57:1–57:13, 2016.
[88] J. H. Ham, D. D. Lee, and L. K. Saul, “Semisupervised alignment of manifolds,” in Proc. Int. Conf. Uncertainty Artif. Intell. (UAI), vol. 10. 2005, pp. 120–127.
[89] X. Mao, B. Lin, D. Cai, X. He, and J. Pei, “Parallel field alignment for cross media retrieval,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2013, pp. 897–906.
[90] Y.-T. Zhuang, Y. Yang, and F. Wu, “Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval,” IEEE Trans. Multimedia, vol. 10, no. 2, pp. 221–229, Feb. 2008.
[91] H. Tong, J. He, M. Li, C. Zhang, and W.-Y. Ma, “Graph based multimodality learning,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2005, pp. 862–871.
[92] Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia, “Cross-media retrieval using query dependent search methods,” Pattern Recognit., vol. 43, no. 8, pp. 2927–2936, Aug. 2010.
[93] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimedia retrieval framework based on semi-supervised ranking and relevance feedback,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 723–742, Apr. 2012.
[94] Y. Zhuang, H. Shan, and F. Wu, “An approach for cross-media retrieval with cross-reference graph and pagerank,” in Proc. IEEE Int. Conf. Multi-Media Modelling (MMM), Jan. 2006, pp. 161–168.
[95] Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval,” IEEE Trans. Multimedia, vol. 10, no. 3, pp. 437–446, Apr. 2008.
[96] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
[97] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2003, pp. 127–134.
[98] D. Putthividhy, H. T. Attias, and S. S. Nagarajan, “Topic regression multi-modal latent Dirichlet allocation for image annotation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 3408–3415.
[99] Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 2407–2414.
[100] Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, “Multi-modal mutual topic reinforce modeling for cross-media retrieval,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2014, pp. 307–316.
[101] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: A real-world Web image database from national University of Singapore,” in Proc. ACM Int. Conf. Image Video Retr. (CIVR), 2009, p. 48.
[102] G. A. Miller, “WordNet: A lexical database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, 1995.
[103] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2009.
[104] X. Hua et al., “Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2013, pp. 243–252.
[105] Y. Pan, T. Yao, X. Tian, H. Li, and C. Ngo, “Click-throughbased subspace learning for image search,” in Proc. ACM Int. Conf. Multimedia (ACM MM), 2014, pp. 233–236.
[106] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Dept. Comput. Sci., Univ. Massachusetts, Amherst, MA, USA, Tech. Rep. 07-49, 2007.
[107] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.

基于对抗的跨媒体检索Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

lingpy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值