Abstract:The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two main problems in the video localization task: 1. **Semantic Overlap Problem**: In video localization, some visual entities may exist simultaneously at the labeled moment and other unlabeled moments, leading to contradictions in feature representations. For example, in a video clip, entities such as "person", "blade", and "belt" appear both in the target moment \(m_4\) and in other moments \(m_1\). Since there are no classification labels in the video localization task, existing methods can only distinguish positive and negative samples based on the labeled moment, which ignores the semantic overlap between different video moments and leads to inconsistent feature representations. 2. **Sparse Labeling Dilemma**: Due to the high cost of the labeling process, usually only a few moments are labeled while a video contains thousands of frames. This severe data imbalance leads to a significant learning bias in naive contrastive learning, that is, unlabeled moments are pushed away by different queries regardless of whether there is a semantic relationship between them. This undermines the uniformity requirement for contrastive learning. ### Solutions To address the above problems, the authors propose a new framework - **Geodesic and Game Localization (G2L)**, which learns semantically aligned and uniformly distributed video feature representations through geodesic distance and game theory. Specifically: - **Geodesic - Distance - Guided Contrastive Learning**: The correlation between video moments is measured by geodesic distance rather than based on temporal position. Geodesic distance can better reflect semantic relevance, thus relaxing the strict localization principle. - **Semantic Shapley Interaction**: The fine - grained semantic alignment between video moments and queries is quantified by the Shapley value in game theory. This helps the model to more accurately distinguish similar video moments and avoid confusion. ### Main Contributions 1. **Proposing the G2L Framework**: Combining geodesic distance and game theory to learn semantic alignment and uniform distribution between videos and queries. 2. **Geodesic - Distance - Guided Contrastive Learning**: A new contrastive learning scheme is proposed, which takes into account the correct semantics of all moments in the video. 3. **Effective Semantic Shapley Interaction Strategy**: Similar video moments are sampled based on geodesic distance, focusing on their subtle differences. 4. **Extensive Experimental Verification**: Experimental results on three public datasets prove the effectiveness of G2L. ### Experimental Results - **ActivityNet - Captions**: On this dataset, G2L achieves an absolute performance improvement of 7.8% and 5.7% respectively compared to the latest contrastive - learning - based methods IVG - DCL and SSCS. - **Charades - STA**: On this dataset, G2L achieves a 1.1% performance improvement on the more stringent evaluation metric "R@1 IoU = 0.7" compared to the latest method MMN. - **TACoS**: On this dataset, G2L achieves state - of - the - art results in most settings, although the performance improvement is relatively small because the sparse - labeling dilemma and semantic - overlap problems in the TACoS dataset are not obvious. In conclusion, by introducing geodesic distance and game theory, G2L effectively solves the semantic - overlap and sparse - labeling problems in the video - localization task and significantly improves the performance of the model.

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Interventional Video Grounding with Dual Contrastive Learning

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

Decoupled Spatial Temporal Graphs for Generic Visual Grounding

Generation-Guided Multi-Level Unified Network for Video Grounding

Learning Comprehensive Visual Grounding for Video Captioning

Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts

Efficient Video Grounding with Which-Where Reading Comprehension

Rethinking Weakly-supervised Video Temporal Grounding from a Game Perspective

Video Grounding and Its Generalization

G$^3$-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding

Weakly-Supervised Video Object Grounding via Causal Intervention

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

End-to-End Dense Video Grounding via Parallel Regression

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Curriculum Multi-Negative Augmentation for Debiased Video Grounding.

Mixup-Augmented Temporally Debiased Video Grounding with Content-Location Disentanglement.

Comprehensive Visual Grounding for Video Description