CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network.

Yuxin Peng,Jinwei Qi,Xin Huang,Yuxin Yuan
DOI: https://doi.org/10.1109/TMM.2017.2742704
2018-01-01
Abstract:Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): The first learning stage is to generate separate representation for each modality and the second learning stage is to get the cross-modal common representation. However the existing methods have three limitations: 1) In the first learning stage they only model intramodality correlation but ignore intermodality correlation with rich complementary context. 2) In the second learning stage they only adopt shallow networks with single-loss regularization but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network and the contributions are as follows: 1) In the first learning stage CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In the second learning stage a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets the experimental results show our CCL approach achieves the best performance.
What problem does this paper attempt to address?