Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Peijun Bao,Chenqi Kong,Zihao Shao,Boon Poh Ng,Meng Hwa Er,Alex C. Kot
2024-12-01
Abstract:Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at <a class="link-external link-https" href="https://github.com/baopj/Vid-Morp" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of relying on a large amount of manually - annotated data in the Video Moment Retrieval (VMR) task. Specifically, the goal of VMR is to accurately locate the described time segments from natural - language queries and unedited videos. However, most of the existing methods rely on manually - annotated data for training, which is not only costly and time - consuming, but also difficult to scale. In addition, these annotated data usually have language and temporal biases, which limit their effectiveness in practical applications. To solve these problems, the paper proposes a new paradigm: pre - training with large - scale unannotated real - world videos to reduce annotation costs. To this end, the authors introduce a large - scale dataset named Vid - Morp, which contains more than 50,000 unedited videos from the real world and 200,000 pseudo - annotated samples. Direct pre - training on these imperfect pseudo - annotations brings significant challenges, including mismatched sentence - video pairs and imprecise time boundaries. To address these challenges, the paper proposes the ReCorrect algorithm, which is divided into two main stages: 1. **Semantic - guided refinement**: By using the semantic similarity between video frames and pseudo - labels, clean up incorrect training samples (such as idle videos and mismatched video - query pairs), and initially adjust the time boundaries. 2. **Memory - consensus correction**: Use a memory bank to track model predictions and gradually correct the time boundaries based on the consensus in the memory. Through this method, the ReCorrect algorithm can show strong generalization ability in a variety of downstream settings, including zero - shot reasoning, unsupervised learning, and fully - supervised learning. Experimental results show that unsupervised ReCorrect reaches approximately 85% of the best fully - supervised performance in the two benchmark tests of Charades - STA and ActivityNet Captions respectively, while zero - shot ReCorrect exceeds 75% and 80% of the best fully - supervised performance. ### Summary The main contributions of this paper include: 1. Introducing the Vid - Morp dataset, which contains more than 50,000 unedited videos from the real world and 200,000 pseudo - annotated samples. 2. Proposing the ReCorrect algorithm to handle errors in pseudo - annotations through semantic - guided refinement and memory - consensus correction. 3. Experimentally verifying the superior performance of ReCorrect in various settings, showing its potential in reducing the dependence on manual annotations in the VMR task. Through these contributions, the paper provides an innovative and effective solution to solve the problem of relying on manual annotations in the video moment retrieval task.