Deep compositional cross-modal learning to rank via local-global alignment

Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting Zhuang
2015-10-13
Abstract:Cross-modal retrieval is a very hot research topic that is imperative to many applications involving multi-modal data. Discovering an appropriate representation for multi-modal data and learning a ranking function are essential to boost the cross-media retrieval. Motivated by the assumption that a compositional cross-modal semantic representation (pairs of images and text) is more attractive for cross-modal ranking, this paper exploits the existing image-text databases to optimize a ranking function for cross-modal retrieval, called deep compositional cross-modal learning to rank (C2MLR). In this paper, C2MLR considers learning a multi-modal embedding from the perspective of optimizing a pairwise ranking problem while enhancing both local alignment and global alignment. In particular, the local alignment (i.e., the alignment of visual objects and textual words) and the global alignment (i.e., the image-level and …
What problem does this paper attempt to address?