Integrating information theory and adversarial learning for cross-modal retrieval

Wei Chen,Yu Liu,Erwin M. Bakker,Michael S. Lew
DOI: https://doi.org/10.1016/j.patcog.2021.107983
IF: 8
2021-09-01
Pattern Recognition
Abstract:<p>Accurately matching visual and textual data in cross-modal retrieval has been widely studied in the multimedia community. To address these challenges posited by the heterogeneity gap and the semantic gap, we propose integrating Shannon information theory and adversarial learning. In terms of the heterogeneity gap, we integrate modality classification and information entropy maximization adversarially. For this purpose, a modality classifier (as a discriminator) is built to distinguish the text and image modalities according to their different statistical properties. This discriminator uses its output probabilities to compute Shannon information entropy, which measures the uncertainty of the modality classification it performs. Moreover, feature encoders (as a generator) project uni-modal features into a commonly shared space and attempt to fool the discriminator by maximizing its output information entropy. Thus, maximizing information entropy gradually reduces the distribution discrepancy of cross-modal features, thereby achieving a domain confusion state where the discriminator cannot classify two modalities confidently. To reduce the semantic gap, Kullback-Leibler (KL) divergence and bi-directional triplet loss are used to associate the intra- and inter-modality similarity between features in the shared space. Furthermore, a regularization term based on KL-divergence with temperature scaling is used to calibrate the biased label classifier caused by the data imbalance issue. Extensive experiments with four deep models on four benchmarks are conducted to demonstrate the effectiveness of the proposed approach.</p>
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two key challenges in cross - modal retrieval: the Heterogeneity Gap and the Semantic Gap. 1. **Heterogeneity Gap**: Since data of different modalities (such as images and texts) have different statistical characteristics, their distributions in the feature space are different, resulting in that even semantically similar features cannot be directly compared. This hinders the effective matching of cross - modal data. 2. **Semantic Gap**: There are differences between the low - level representations of data by computers (such as pixels or symbols) and the high - level perception of users, making it difficult for computers to understand the high - level semantic information of data. To address these challenges, the author proposes a method that combines information theory and adversarial learning. Specifically: - **Reducing the Heterogeneity Gap**: By introducing a modality classifier and an information entropy maximization mechanism, the features of different modalities are gradually fused in the shared space, thereby reducing the inconsistency of feature distributions. The modality classifier attempts to distinguish between image and text modalities, while the generator attempts to generate modality - invariant features to confuse the modality classifier so that it cannot confidently classify the modalities. - **Reducing the Semantic Gap**: Use Kullback - Leibler (KL) divergence loss and bidirectional triplet loss to maintain the semantic similarity of features in the shared space. In addition, a regularization term based on KL divergence and temperature scaling is introduced to correct the problems caused by data imbalance. Through these methods, the paper aims to build a more effective cross - modal retrieval system and improve the accuracy and efficiency of retrieval.