Multi-Modal Retrieval Via Deep Textual-Visual Correlation Learning

Jun Song,Yueyang Wang,Fei Wu,Weiming Lu,Siliang Tang,Yueting Zhuang
DOI: https://doi.org/10.1007/978-3-319-23989-7_19
2015-01-01
Abstract:In this paper, we consider multi-modal retrieval from the perspective of deep textual-visual learning so as to preserve the correlations between multi-modal data. More specifically, We propose a general multi-modal retrieval algorithm to maximize the canonical correlations between multi-modal data via deep learning, which we call Deep Textual-Visual correlation learning (DTV). In DTV, given pairs of images and their describing documents, a convolutional neural network is implemented to learn the visual representation of images and a dependency-tree recursive neural network(DT-RNN) is conducted to learn compositional textual representations of documents respectively, then DTV projects the visual-textual representation into a common embedding space where each pair of multi-modal data is maximally correlated subject to being unrelated with other pairs by matrix-vector canonical correlation analysis (CCA). The experimental results indicate the effectiveness of our proposed DTV when applied to multi-modal retrieval.
What problem does this paper attempt to address?