Multi-granularity Correspondence Learning from Long-term Noisy Videos

Yijie Lin,Jie Zhang,Zhenyu Huang,Jia Liu,Zujie Wen,Xi Peng
2024-01-30
Abstract:Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the problem of multi-granularity correspondence learning in long videos. This is because existing research on video language mainly focuses on short video clips, and the temporal dependency in long videos has been rarely explored, mainly due to the high computational cost of modeling long videos. To address this issue, the paper proposes a new approach called Norton, which utilizes the unified optimal transport framework to handle the multi-granularity noise correspondence (MNC) problem between video clips and captions. MNC includes coarse-grained mismatches (such as asynchronous and unrelated clip-caption pairs) and fine-grained mismatches (mismatches between frames and words). Norton captures long-term dependency by utilizing the optimal transport with contrastive loss between video segments and clip-caption pairs. For coarse-grained mismatches, Norton filters out irrelevant clips and captions through alignable hint buckets, and reorders asynchronous clip-caption pairs based on transportation distances. For fine-grained mismatches, Norton uses the soft maximum operator to identify key frames and words. Furthermore, Norton handles potential error negative samples in contrastive learning by correcting the alignment targets, ensuring accurate temporal modeling. Experiments verify the effectiveness of Norton in tasks such as video retrieval, video question answering, and action segmentation. This method provides a more practical and scalable solution by reducing computational complexity and memory cost, addressing the challenges of long video understanding in practical applications.