Abstract:Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR. Code and Datasets are in \href{<a class="link-external link-https" href="https://github.com/hlchen23/VERIFIED" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/hlchen23/VERIFIED" rel="external noopener nofollow">this https URL</a>}.

Consumer Video Understanding

Analyzing and Predicting Consumer Response to Short Videos in E-Commerce

FCVID : Fudan-Columbia Video Dataset

Categorizing Big Video Data on the Web: Challenges and Opportunities

LSVC2017

Vcdb: A Large-Scale Database for Partial Copy Detection in Videos

Video diver: generic video indexing with diverse features.

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

Towards Open-Vocabulary Video Instance Segmentation

Robust Semantic Concept Detection in Large Video Collections

Robust Semantic Video Indexing by Harvesting Web Images.

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Video Quality Assessment: A Comprehensive Survey

Random-sampling-based spatial-temporal feature for consumer video concept classification

HMDB: A large video database for human motion recognition

Towards Open-Vocabulary Video Semantic Segmentation

VCD: Knowledge Base Guided Visual Commonsense Discovery in Images

Robust Commercial Retrieval in Video Streams

A Study of Actor and Action Semantic Retention in Video Supervoxel Segmentation

WebVision Database: Visual Learning and Understanding from Web Data