Integrating information theory and adversarial learning for cross-modal retrieval

Wei Chen,Yu Liu,Erwin M. Bakker,Michael S. Lew

DOI: https://doi.org/10.1016/j.patcog.2021.107983

IF: 8

2021-09-01

Pattern Recognition

Abstract:<p>Accurately matching visual and textual data in cross-modal retrieval has been widely studied in the multimedia community. To address these challenges posited by the heterogeneity gap and the semantic gap, we propose integrating Shannon information theory and adversarial learning. In terms of the heterogeneity gap, we integrate modality classification and information entropy maximization adversarially. For this purpose, a modality classifier (as a discriminator) is built to distinguish the text and image modalities according to their different statistical properties. This discriminator uses its output probabilities to compute Shannon information entropy, which measures the uncertainty of the modality classification it performs. Moreover, feature encoders (as a generator) project uni-modal features into a commonly shared space and attempt to fool the discriminator by maximizing its output information entropy. Thus, maximizing information entropy gradually reduces the distribution discrepancy of cross-modal features, thereby achieving a domain confusion state where the discriminator cannot classify two modalities confidently. To reduce the semantic gap, Kullback-Leibler (KL) divergence and bi-directional triplet loss are used to associate the intra- and inter-modality similarity between features in the shared space. Furthermore, a regularization term based on KL-divergence with temperature scaling is used to calibrate the biased label classifier caused by the data imbalance issue. Extensive experiments with four deep models on four benchmarks are conducted to demonstrate the effectiveness of the proposed approach.</p>

computer science, artificial intelligence,engineering, electrical & electronic

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two key challenges in cross - modal retrieval: the Heterogeneity Gap and the Semantic Gap. 1. **Heterogeneity Gap**: Since data of different modalities (such as images and texts) have different statistical characteristics, their distributions in the feature space are different, resulting in that even semantically similar features cannot be directly compared. This hinders the effective matching of cross - modal data. 2. **Semantic Gap**: There are differences between the low - level representations of data by computers (such as pixels or symbols) and the high - level perception of users, making it difficult for computers to understand the high - level semantic information of data. To address these challenges, the author proposes a method that combines information theory and adversarial learning. Specifically: - **Reducing the Heterogeneity Gap**: By introducing a modality classifier and an information entropy maximization mechanism, the features of different modalities are gradually fused in the shared space, thereby reducing the inconsistency of feature distributions. The modality classifier attempts to distinguish between image and text modalities, while the generator attempts to generate modality - invariant features to confuse the modality classifier so that it cannot confidently classify the modalities. - **Reducing the Semantic Gap**: Use Kullback - Leibler (KL) divergence loss and bidirectional triplet loss to maintain the semantic similarity of features in the shared space. In addition, a regularization term based on KL divergence and temperature scaling is introduced to correct the problems caused by data imbalance. Through these methods, the paper aims to build a more effective cross - modal retrieval system and improve the accuracy and efficiency of retrieval.

Integrating information theory and adversarial learning for cross-modal retrieval

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Semantic Consistency Hashing for Cross-Modal Retrieval

Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval

Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Dual discriminant adversarial cross-modal retrieval

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Domain Uncertainty Based on Information Theory for Cross-Modal Hash Retrieval

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Category Alignment Adversarial Learning for Cross-modal Retrieval

Adversarial Learning For Cross-Modal Retrieval With Wasserstein Distance

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval

Clustering-driven Deep Adversarial Hashing for scalable unsupervised cross-modal retrieval

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

Discriminative Dictionary Learning with Common Label Alignment for Cross-Modal Retrieval.

Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching