Abstract:Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the challenge of locating specific moments in long videos (20-120 minutes). Existing methods for moment localization in short videos (5-30 seconds) perform poorly when dealing with long videos because the task of locating specific moments in long videos is akin to "finding a needle in a haystack." Most real-life videos, such as those on YouTube and AR/VR, are longer, making it crucial to solve this problem. ### Specific Problems 1. **Limitations of Existing Methods**: - Existing methods typically operate in two stages: segment retrieval and moment localization. This separated approach limits the retrieval module's understanding of fine-grained events, thereby affecting the effectiveness of specific moment detection. - Existing video retrieval models are mainly designed for high-level video topic retrieval, whereas the task of specific moment localization requires an understanding of fine-grained events. 2. **Technical Challenges**: - Specific moments in long videos are very brief, making them difficult to find using simple segment retrieval and moment localization methods. - It is necessary to handle multiple granularity levels of video information, such as segments and frames. ### Solution To address the above issues, the authors propose RGNet (Retrieval and Grounding Network), a unified network for segment retrieval and moment localization. The main contributions of RGNet include: 1. **Unified Network Architecture**: - RGNet deeply integrates segment retrieval and moment localization into a single network, enabling end-to-end training. This improves the retrieval module's understanding of fine-grained events and enhances overall performance through shared features and joint optimization. 2. **RG-Encoder**: - A novel transformer encoder, RG-Encoder, is proposed, which models video information at different granularity levels through sparse attention mechanisms and attention loss. - The encoder can operate simultaneously at the segment and frame levels, enhancing the ability to locate specific moments in long videos. 3. **Contrastive Segment Sampling Technique**: - A contrastive segment sampling technique is introduced to simulate the training paradigm of long videos, improving the network's training effectiveness on large-scale negative samples. ### Experimental Results - RGNet achieved state-of-the-art performance on two long video moment localization datasets (Ego4D and MAD). - Compared to existing methods, RGNet improved by 9.7% and 18.1% on the R1.3 and R5.3 metrics, respectively. - The unified network architecture significantly improved the performance of segment retrieval and moment localization, especially in long videos. ### Conclusion RGNet effectively addresses the challenge of locating specific moments in long videos through a unified network for segment retrieval and moment localization, significantly enhancing the performance of related tasks.

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Scanning Only Once: an End-to-end Framework for Fast Temporal Grounding in Long Videos

Dense Events Grounding in Video.

End-to-End Dense Video Grounding via Parallel Regression

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

SnAG: Scalable and Accurate Video Grounding

Context-aware network with foreground recalibration for grounding natural language in video

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

End-to-end Multi-modal Video Temporal Grounding

Language-Guided Multi-Granularity Context Aggregation for Temporal Sentence Grounding

Localizing Moments in Long Video Via Multimodal Guidance

Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding

Multi-sentence Video Grounding for Long Video Generation

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Proposal-Free Video Grounding with Contextual Pyramid Network

Visual Relation Grounding in Videos

Rethinking Video Sentence Grounding from a Tracking Perspective with Memory Network and Masked Attention