Wolf: Captioning Everything with a World Summarization Framework

Boyi Li,Ligeng Zhu,Ran Tian,Shuhan Tan,Yuxiao Chen,Yao Lu,Yin Cui,Sushant Veer,Max Ehrlich,Jonah Philion,Xinshuo Weng,Fuzhao Xue,Andrew Tao,Ming-Yu Liu,Sanja Fidler,Boris Ivanovic,Trevor Darrell,Jitendra Malik,Song Han,Marco Pavone
2024-07-27
Abstract:We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: <a class="link-external link-https" href="https://wolfv0.github.io/leaderboard.html" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key challenges in the field of video captioning and proposes a framework called the WOrLd summarization Framework (Wolf) to improve the quality of video understanding, automatic annotation, and caption generation. Specifically, the main issues addressed in the paper include: 1. **Scarcity of high-quality annotated data**: Video captions on the internet often contain errors or do not match the video content, and large-scale manual annotation is very expensive. 2. **Higher difficulty of video captioning compared to image captioning**: Video captioning needs to handle additional temporal correlations and complexities brought by camera movements, and existing captioning models perform poorly in temporal reasoning and scene understanding. 3. **Lack of standard benchmarks for evaluating video caption quality**: Existing video question-answering benchmarks are usually limited to short answers, making it difficult to effectively evaluate hallucinations in long and detailed captions. 4. **Critical importance of caption accuracy and completeness**: In tasks that use text descriptions to assist in planning and control, inaccurate or incomplete captions can lead to safety risks. To address these challenges, the paper makes the following contributions: 1. Designed the first WOrLd summarization Framework (Wolf) for video captioning and introduced a metric called CapScore based on large language models (LLM) to evaluate the quality of captions. Experimental results show that this method achieves significant improvements on multiple metrics. 2. Constructed the Wolf benchmark and four manually annotated datasets covering areas such as autonomous driving, general scenes, and robotic operations to facilitate comprehensive comparisons. 3. Open-sourced the code, data, and leaderboard to promote further development in the fields of video understanding, caption generation, and data alignment. In summary, this research aims to improve the quality of video captions by introducing innovative methods and technologies and establishing a comprehensive evaluation system, thereby advancing the entire field.