Wolf: Captioning Everything with a World Summarization Framework

Boyi Li,Ligeng Zhu,Ran Tian,Shuhan Tan,Yuxiao Chen,Yao Lu,Yin Cui,Sushant Veer,Max Ehrlich,Jonah Philion,Xinshuo Weng,Fuzhao Xue,Andrew Tao,Ming-Yu Liu,Sanja Fidler,Boris Ivanovic,Trevor Darrell,Jitendra Malik,Song Han,Marco Pavone

2024-07-27

Abstract:We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: <a class="link-external link-https" href="https://wolfv0.github.io/leaderboard.html" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key challenges in the field of video captioning and proposes a framework called the WOrLd summarization Framework (Wolf) to improve the quality of video understanding, automatic annotation, and caption generation. Specifically, the main issues addressed in the paper include: 1. **Scarcity of high-quality annotated data**: Video captions on the internet often contain errors or do not match the video content, and large-scale manual annotation is very expensive. 2. **Higher difficulty of video captioning compared to image captioning**: Video captioning needs to handle additional temporal correlations and complexities brought by camera movements, and existing captioning models perform poorly in temporal reasoning and scene understanding. 3. **Lack of standard benchmarks for evaluating video caption quality**: Existing video question-answering benchmarks are usually limited to short answers, making it difficult to effectively evaluate hallucinations in long and detailed captions. 4. **Critical importance of caption accuracy and completeness**: In tasks that use text descriptions to assist in planning and control, inaccurate or incomplete captions can lead to safety risks. To address these challenges, the paper makes the following contributions: 1. Designed the first WOrLd summarization Framework (Wolf) for video captioning and introduced a metric called CapScore based on large language models (LLM) to evaluate the quality of captions. Experimental results show that this method achieves significant improvements on multiple metrics. 2. Constructed the Wolf benchmark and four manually annotated datasets covering areas such as autonomous driving, general scenes, and robotic operations to facilitate comprehensive comparisons. 3. Open-sourced the code, data, and leaderboard to promote further development in the fields of video understanding, caption generation, and data alignment. In summary, this research aims to improve the quality of video captions by introducing innovative methods and technologies and establishing a comprehensive evaluation system, thereby advancing the entire field.

Wolf: Captioning Everything with a World Summarization Framework

The nature of respiratory changes associated with sleep onset.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Discriminative Latent Semantic Graph for Video Captioning

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

GL-RG: Global-Local Representation Granularity for Video Captioning

Video Captioning Using Global-Local Representation

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Exploring the Role of Audio in Video Captioning

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Seeing Bot

Cap2Sum: Learning to Summarize Videos by Generating Captions

CLIP4Caption ++: Multi-CLIP for Video Caption

Delving Deeper into the Decoder for Video Captioning

Personalized Video Summarization by Multimodal Video Understanding

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

End-to-End Video Captioning Based on Multiview Semantic Alignment for Human–Machine Fusion