3View deep canonical correlation analysis for cross-modal retrieval

Jie Shao,Zhi-Cheng Zhao,Fei Su,Ting Yue
DOI: https://doi.org/10.1109/VCIP.2015.7457870
2015-01-01
Abstract:This paper investigates the problem of modeling Internet images and associated text for cross-modal retrieval tasks such as text-to-image search, and image-to-text search. Canonical correlation analysis (CCA), a classic two view approach for mapping text and image into a common latent space, does not make use of the semantic information of text and image pairs. We use CCA to map text, image and semantic information into a common latent space, in which the correlation of the three views is maximized. To improve the performance of CCA, in this paper, 3view-Deep Canonical Correlation Analysis (3view-DCCA), a nonlinear expansion of CCA is proposed to learn the complex nonlinear transformations between the three views. Like most deep learning methods, DCCA is easy to over-fitting. To overcome over-fitting, we add the reconstruct loss of each view into the loss function, which include the correlation loss of every two views and regularization of parameters. Inspired by PageRank, we propose a search-based similarity method to score relevance. The proposed model (3view-DCCA) is evaluated on three publicly available data sets from real scenes. We demonstrate that our deep model performs significantly better than traditional canonical correlation analysis based models and several other deep learning models on cross-modal retrieval tasks.
What problem does this paper attempt to address?