Abstract:Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR. Code and Datasets are in \href{<a class="link-external link-https" href="https://github.com/hlchen23/VERIFIED" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/hlchen23/VERIFIED" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of imprecise video segment localization in the existing Video Corpus Moment Retrieval (VCMR) tasks due to relatively coarse - grained query statements. Specifically: 1. **Imprecise localization caused by coarse - grained queries**: - In the existing VCMR settings, queries are usually coarse - grained, which makes it difficult for the model to distinguish the most matching video segments, especially when there are multiple partially - matching candidate segments. - This situation hinders cross - modal retrieval and makes it difficult for the model to learn discriminative video features. 2. **The need for fine - grained VCMR**: - The paper proposes a more challenging fine - grained VCMR benchmark, requiring the model to be able to precisely locate the best - matching video segment from a large number of untrimmed video corpora according to fine - grained text queries. - This requires the model to not only be able to understand the details in the text description but also to be able to distinguish the target segment among the partially - matching candidate segments. 3. **The problem of high - quality data annotation**: - The annotation of fine - grained video - text datasets depends on a large amount of manual work and domain knowledge, which limits its productivity and scalability. - Therefore, the paper proposes an automated video - text annotation pipeline named VERIFIED, which uses large - language models (LLMs) and large multi - modal models (LMMs) to generate fine - grained subtitles containing reliable static and dynamic details. ### Overview of the solution To address the above problems, the paper proposes the following solutions: 1. **VERIFIED automatic annotation pipeline**: - **Static - enhanced subtitle generation**: Extract key frames and generate subtitles containing rich static details. - **Dynamic - enhanced subtitle generation**: Guide the model to capture dynamic changes through video question answering (VQA) and generate subtitles containing dynamic details. - **Fine - grained perception noise evaluator**: Screen out high - quality annotated content by introducing hard negative samples to enhance contrast loss and matching loss. 2. **Construct a new fine - grained VCMR benchmark**: - Based on widely - adopted VCMR datasets (such as Charades - STA, DiDeMo, ActivityNet Captions), use the VERIFIED pipeline to construct three new fine - grained datasets (Charades - FIG, DiDeMo - FIG, ActivityNet - FIG), which show higher annotation quality. 3. **Evaluate the performance of existing models**: - Evaluate several state - of - the - art VCMR models on the newly - constructed fine - grained VCMR benchmark. The results show that there is still much room for improvement in the performance of these models on fine - grained tasks. Through these methods, the paper not only solves the annotation problem in fine - grained video understanding but also provides a more challenging and practical benchmark for future research.

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Video Moment Retrieval with Noisy Labels

LVBench: An Extreme Long Video Understanding Benchmark

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Beyond Coarse-Grained Matching in Video-Text Retrieval

VideoMCC: a New Benchmark for Video Comprehension

Towards Event-oriented Long Video Understanding

MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding