Abstract:Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at <a class="link-external link-https" href="https://github.com/baopj/Vid-Morp" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem of relying on a large amount of manually - annotated data in the Video Moment Retrieval (VMR) task. Specifically, the goal of VMR is to accurately locate the described time segments from natural - language queries and unedited videos. However, most of the existing methods rely on manually - annotated data for training, which is not only costly and time - consuming, but also difficult to scale. In addition, these annotated data usually have language and temporal biases, which limit their effectiveness in practical applications. To solve these problems, the paper proposes a new paradigm: pre - training with large - scale unannotated real - world videos to reduce annotation costs. To this end, the authors introduce a large - scale dataset named Vid - Morp, which contains more than 50,000 unedited videos from the real world and 200,000 pseudo - annotated samples. Direct pre - training on these imperfect pseudo - annotations brings significant challenges, including mismatched sentence - video pairs and imprecise time boundaries. To address these challenges, the paper proposes the ReCorrect algorithm, which is divided into two main stages: 1. **Semantic - guided refinement**: By using the semantic similarity between video frames and pseudo - labels, clean up incorrect training samples (such as idle videos and mismatched video - query pairs), and initially adjust the time boundaries. 2. **Memory - consensus correction**: Use a memory bank to track model predictions and gradually correct the time boundaries based on the consensus in the memory. Through this method, the ReCorrect algorithm can show strong generalization ability in a variety of downstream settings, including zero - shot reasoning, unsupervised learning, and fully - supervised learning. Experimental results show that unsupervised ReCorrect reaches approximately 85% of the best fully - supervised performance in the two benchmark tests of Charades - STA and ActivityNet Captions respectively, while zero - shot ReCorrect exceeds 75% and 80% of the best fully - supervised performance. ### Summary The main contributions of this paper include: 1. Introducing the Vid - Morp dataset, which contains more than 50,000 unedited videos from the real world and 200,000 pseudo - annotated samples. 2. Proposing the ReCorrect algorithm to handle errors in pseudo - annotations through semantic - guided refinement and memory - consensus correction. 3. Experimentally verifying the superior performance of ReCorrect in various settings, showing its potential in reducing the dependence on manual annotations in the VMR task. Through these contributions, the paper provides an innovative and effective solution to solve the problem of relying on manual annotations in the video moment retrieval task.

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Video Moment Retrieval with Noisy Labels

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Prompt-based Zero-shot Video Moment Retrieval

Transferable Video Moment Localization by Moment-Guided Query Prompting

Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement

Partial Annotation-based Video Moment Retrieval Via Iterative Learning

Unsupervised Video Moment Retrieval with Knowledge-based Pseudo Supervision Construction

Video Editing for Video Retrieval

Temporal Perceiving Video-Language Pre-training

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Weakly-Supervised Video Moment Retrieval Via Semantic Completion Network

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

Video Moment Retrieval from Text Queries via Single Frame Annotation

Number it: Temporal Grounding Videos like Flipping Manga

Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism