A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Matthew Gwilliam,Michael Cogswell,Meng Ye,Karan Sikka,Abhinav Shrivastava,Ajay Divakaran
2023-12-01
Abstract:Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at <a class="link-external link-https" href="https://mgwillia.github.io/10k-words" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses the shortcomings of long video retrieval systems in terms of description diversity. Specifically: 1. **Limitations of Existing Systems**: - Existing long video retrieval systems typically use a single long text (paragraph) to describe the entire video, which overlooks the rich and diverse effective description methods that videos may have. - In reality, videos can be described in detail, summarized briefly, or even have only parts of their content described. 2. **Proposed New Problem Framework**: - The paper proposes a new problem framework called "10,000 Words," aimed at generating a diverse video description dataset that covers video texts of different lengths, levels of simplification, and partial descriptions. - Through this approach, the paper attempts to evaluate and improve the ability of existing video retrieval models to handle diverse descriptions. 3. **Specific Methods and Contributions**: - The paper creates three datasets (ActivityNet10k, QuerYD10k, and LF-VILA10k) to enrich existing video descriptions through a flexible data generation process. - It utilizes the latest large language models (LLMs) to generate diverse descriptions and conducts thorough manual checks to verify the quality of the generated data. - The performance of existing models is evaluated on these datasets, revealing difficulties in handling short and partial descriptions. - A lightweight fine-tuning method is proposed, using contrastive loss to improve the model's understanding of diverse descriptions, thereby enhancing the model's performance on standard tasks. In summary, the paper aims to improve the robustness and generalization ability of video retrieval systems by introducing a diverse video description dataset.