Abstract:Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.

What problem does this paper attempt to address?

The paper primarily addresses the issue of noisy correspondences in cross-modal retrieval (i.e., imperfect alignment or incorrect matching in training data) by proposing a new method called DisNCL (Disentangled Noisy Correspondence Learning). ### Research Background and Problem Definition - **Background**: In cross-modal retrieval tasks, it is necessary to retrieve the most relevant samples from data of different modalities, such as image-text pairing. Existing methods usually assume that the training data is well-matched, but in practical applications, this assumption often does not hold because real-world data inevitably contains noisy correspondences (e.g., mismatches between images and texts). - **Problem**: How to improve the performance of cross-modal retrieval in the presence of noisy correspondences? Specifically, how to design robust learning strategies to handle these noisy correspondences when there is a large amount of noise in the training data? ### Solution Overview - **Core Idea**: Achieve feature disentanglement through an information-theoretic framework, separating modality-invariant information (MII) and modality-specific information (MEI), and then perform similarity prediction based on this to mitigate the impact of MEI on model performance. - **Specific Methods**: - Propose a new objective function \(L_{\text{Dis}}\) based on the information bottleneck principle to extract MII and MEI. - Perform similarity prediction in the disentangled modality-invariant subspace and use a soft matching objective to model many-to-many relationships, thereby improving the model's robustness to noise. - Use sample filtering techniques and robust hinge loss to identify and suppress the negative impact of noisy correspondences. - Theoretically prove the superiority of the proposed DisNCL method in terms of noise robustness and feature disentanglement. ### Main Contributions 1. **DisNCL**: Introduces for the first time a certified optimal cross-modal disentanglement efficiency to enhance the robustness of the model to noisy multi-modal training data. 2. **Similarity Prediction**: Enhances the ability to identify and suppress noisy correspondences by performing accurate similarity prediction in the modality-invariant subspace. 3. **Soft Matching Objective**: Models many-to-many relationships in complex multi-modal data by estimating soft targets, further improving the model's effectiveness. ### Conclusion This paper addresses the issue of noisy correspondences in cross-modal retrieval by proposing a new information-theoretic framework, DisNCL, which achieves effective disentanglement of modality-invariant information and modality-specific information. A series of experiments validate the effectiveness of this method and the correctness of the theoretical analysis.

Disentangled Noisy Correspondence Learning

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Noisy Correspondence Learning with Meta Similarity Correction

Learning with Noisy Correspondence

Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Learning From Noisy Correspondence With Tri-Partition for Cross-Modal Matching

Towards a Unified Framework of Contrastive Learning for Disentangled Representations

Disentanglement Translation Network for multimodal sentiment analysis

Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences

Agreement or Disagreement in Noise-tolerant Mutual Learning?

Disentangled Contrastive Collaborative Filtering

NAC: Mitigating Noisy Correspondence in Cross-Modal Matching Via Neighbor Auxiliary Corrector.

Disentangled Contrastive Learning for Learning Robust Textual Representations

CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network.

Learning Disentangled Label Representations for Multi-label Classification

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Symmetric Cross Entropy for Robust Learning with Noisy Labels

Unsupervised Conversation Disentanglement through Co-Training