Abstract:Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at https://github.com/sLdZ0306/CCR https://github.com/sLdZ0306/CCR.

Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos.

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism

Siamese Alignment Network for Weakly Supervised Video Moment Retrieval

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Weakly Supervised Moment Localization with Natural Language Based on Semantic Reconstruction

Weakly-Supervised Video Moment Retrieval Via Semantic Completion Network

Moment is Important: Language-Based Video Moment Retrieval Via Adversarial Learning

Are Binary Annotations Sufficient? Video Moment Retrieval Via Hierarchical Uncertainty-Based Active Learning

Weakly Supervised Video Moment Retrieval From Text Queries

Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Video Corpus Moment Retrieval Via Deformable Multigranularity Feature Fusion and Adversarial Training

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization

Triadic Temporal-Semantic Alignment for Weakly-Supervised Video Moment Retrieval

Video Moment Retrieval with Noisy Labels

STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval