Abstract:The heterogeneity gap leads to inconsistent distributions and representations between image and text, which rises a challenging task to measure their similarities and construct cross-media correlation between them. The existing works mainly model the cross-media correlation in a common subspace, which causes insufficient correlation modeling in such third-party subspace with intermediate unidirectional transformation. Inspired by the recent advances of neural machine translation, which aims to establish a corresponding relationship between two entirely different languages, we can naturally discover that it has striking common characteristic with cross-media correlation learning to consider image and text as bilingual pairs, where the image is treated as a special kind of language to provide visual description, so that bidirectional transformation can be conducted between image and text to effectively explore cross-media correlation in the feature space of each media type. Thus, we propose a reinforced cross-media bidirectional translation (RCBT) approach to model the correlation between visual and textual descriptions. First, cross-media bidirectional translation mechanism is proposed to conduct direct transformation between the bilingual pairs of visual and textual descriptions bidirectionally, where the cross-media correlation can be effectively captured in both feature spaces of image and text through bidirectional translation training. Second, cross-media context-aware network with residual attention is proposed to exploit the rich spatial and temporal context hints with cross-media convolutional recurrent neural network, which can lead to more precise correlation learning for promoting bidirectional translation process. Third, cross-media reinforcement learning is proposed to perform a two-agent communication game played as a round between image and text to boost the bidirectional translation process, and we further extract inter-media and intra-media reward signals to provide complementary clues for learning cross-media correlation. Extensive experiments are conducted on cross-media retrieval to verify the effectiveness of our proposed RCBT approach, compared with 11 state-of-the-art methods on three cross-media datasets.

Progressive Cross-Media Correlation Learning.

Image Retrieval by Cross-Media Relevance Fusion.

Attention-Sharing Correlation Learning For Cross-Media Retrieval

Life-long Cross-media Correlation Learning

Cross-media Residual Correlation Learning

Multiple Kernel Visual-Auditory Representation Learning for Retrieval

CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network.

Cross-media Retrieval by Intra-Media and Inter-Media Correlation Mining

Reinforced Cross-Media Correlation Learning by Context-Aware Bidirectional Translation

Cross-modality Correlation Propagation for Cross-Media Retrieval

Deep Cross-Media Knowledge Transfer

Understanding Visual-Auditory Correlation from Heterogeneous Features for Cross-Media Retrieval

Quintuple-Media Joint Correlation Learning with Deep Compression and Regularization

Cross-media Retrieval by Exploiting Fine-Grained Correlation at Entity Level

Learning Semantic Correlations for Cross-Media Retrieval.

Cross-Media Retrieval Method Based on Content Correlations

Self-supervised Correlation Learning for Cross-Modal Retrieval

Cross-media retrieval by cluster-based correlation analysis

Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval

Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks.

Coupled Feature Mapping and Correlation Mining for Cross-Media Retrieval