A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

Haochen Han,Minnan Luo,Huan Liu,Fang Nan
2024-03-20
Abstract:Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges -- enforcing the multimodal samples to \emph{align incorrect semantics} and \emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
Computer Vision and Pattern Recognition,Information Retrieval,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily aims to address the issue of supervised cross-modal retrieval (CMR) with noisy labels. Specifically, the paper focuses on the following aspects: 1. **Impact of Noisy Labels**: - Noisy labels can cause unrelated samples to be mistakenly considered similar in the shared space. - Noisy labels increase the heterogeneity gap between different modalities, thereby harming cross-modal retrieval performance. 2. **Proposed Method**: - A unified framework UOT-RCL (Robust Cross-Modal Retrieval based on Optimal Transport) is proposed, which utilizes optimal transport theory to correct noisy labels and reduce the differences between multimodal data. - By gradually correcting noisy labels through partial optimal transport (partial OT), a novel cross-modal consistent cost function is designed to integrate data from different modalities. - A relationship alignment method based on optimal transport is used to infer semantic matches between different modalities, further reducing the heterogeneity gap. ### Main Contributions - A new framework based on optimal transport is proposed to handle supervised cross-modal retrieval tasks with noisy labels. - A semantic alignment method based on partial optimal transport is designed to gradually correct noisy labels. - A relationship alignment method based on optimal transport is proposed to reduce the semantic-level matching gap between different modalities. - Extensive experiments are conducted on 3 widely used multimodal datasets, demonstrating the effectiveness and robustness of the proposed method in handling noisy labels.