Disentangled Noisy Correspondence Learning

Zhuohang Dang,Minnan Luo,Jihong Wang,Chengyou Jia,Haochen Han,Herun Wan,Guang Dai,Xiaojun Chang,Jingdong Wang
2024-08-10
Abstract:Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses the issue of noisy correspondences in cross-modal retrieval (i.e., imperfect alignment or incorrect matching in training data) by proposing a new method called DisNCL (Disentangled Noisy Correspondence Learning). ### Research Background and Problem Definition - **Background**: In cross-modal retrieval tasks, it is necessary to retrieve the most relevant samples from data of different modalities, such as image-text pairing. Existing methods usually assume that the training data is well-matched, but in practical applications, this assumption often does not hold because real-world data inevitably contains noisy correspondences (e.g., mismatches between images and texts). - **Problem**: How to improve the performance of cross-modal retrieval in the presence of noisy correspondences? Specifically, how to design robust learning strategies to handle these noisy correspondences when there is a large amount of noise in the training data? ### Solution Overview - **Core Idea**: Achieve feature disentanglement through an information-theoretic framework, separating modality-invariant information (MII) and modality-specific information (MEI), and then perform similarity prediction based on this to mitigate the impact of MEI on model performance. - **Specific Methods**: - Propose a new objective function \(L_{\text{Dis}}\) based on the information bottleneck principle to extract MII and MEI. - Perform similarity prediction in the disentangled modality-invariant subspace and use a soft matching objective to model many-to-many relationships, thereby improving the model's robustness to noise. - Use sample filtering techniques and robust hinge loss to identify and suppress the negative impact of noisy correspondences. - Theoretically prove the superiority of the proposed DisNCL method in terms of noise robustness and feature disentanglement. ### Main Contributions 1. **DisNCL**: Introduces for the first time a certified optimal cross-modal disentanglement efficiency to enhance the robustness of the model to noisy multi-modal training data. 2. **Similarity Prediction**: Enhances the ability to identify and suppress noisy correspondences by performing accurate similarity prediction in the modality-invariant subspace. 3. **Soft Matching Objective**: Models many-to-many relationships in complex multi-modal data by estimating soft targets, further improving the model's effectiveness. ### Conclusion This paper addresses the issue of noisy correspondences in cross-modal retrieval by proposing a new information-theoretic framework, DisNCL, which achieves effective disentanglement of modality-invariant information and modality-specific information. A series of experiments validate the effectiveness of this method and the correctness of the theoretical analysis.