A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Matthew Gwilliam,Michael Cogswell,Meng Ye,Karan Sikka,Abhinav Shrivastava,Ajay Divakaran

2023-12-01

Abstract:Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at <a class="link-external link-https" href="https://mgwillia.github.io/10k-words" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the shortcomings of long video retrieval systems in terms of description diversity. Specifically: 1. **Limitations of Existing Systems**: - Existing long video retrieval systems typically use a single long text (paragraph) to describe the entire video, which overlooks the rich and diverse effective description methods that videos may have. - In reality, videos can be described in detail, summarized briefly, or even have only parts of their content described. 2. **Proposed New Problem Framework**: - The paper proposes a new problem framework called "10,000 Words," aimed at generating a diverse video description dataset that covers video texts of different lengths, levels of simplification, and partial descriptions. - Through this approach, the paper attempts to evaluate and improve the ability of existing video retrieval models to handle diverse descriptions. 3. **Specific Methods and Contributions**: - The paper creates three datasets (ActivityNet10k, QuerYD10k, and LF-VILA10k) to enrich existing video descriptions through a flexible data generation process. - It utilizes the latest large language models (LLMs) to generate diverse descriptions and conducts thorough manual checks to verify the quality of the generated data. - The performance of existing models is evaluated on these datasets, revealing difficulties in handling short and partial descriptions. - A lightweight fine-tuning method is proposed, using contrastive loss to improve the model's understanding of diverse descriptions, thereby enhancing the model's performance on standard tasks. In summary, the paper aims to improve the robustness and generalization ability of video retrieval systems by introducing a diverse video description dataset.

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Not All Words are Equal: Video-specific Information Loss for Video Captioning

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Distilling Vision-Language Models on Millions of Videos

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

Vript: A Video Is Worth Thousands of Words

Non-Autoregressive Coarse-to-Fine Video Captioning

Beyond Coarse-Grained Matching in Video-Text Retrieval

VideoMCC: a New Benchmark for Video Comprehension

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Tarsier: Recipes for Training and Evaluating Large Video Description Models