StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Junming Lin,Zheng Fang,Chi Chen,Zihao Wan,Fuwen Luo,Peng Li,Yang Liu,Maosong Sun
2024-11-06
Abstract:The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the inadequacies of multimodal large language models (MLLMs) in streaming video understanding. Although existing MLLMs perform well in offline video understanding tasks, they exhibit significant gaps when handling real-time streaming video input. Specifically, these models need to preload all video frames before querying, and cannot watch, listen, think, and respond to streaming input in real-time like humans. This indicates that current MLLMs have clear limitations in real-time streaming video understanding. To evaluate and advance the development of MLLMs in streaming video understanding, the authors propose a benchmark called **StreamingBench**. StreamingBench includes 18 tasks, involving 900 videos and 4,500 manually annotated question-answer pairs. Each video poses five questions at different time points to simulate continuous streaming scenarios. These tasks primarily assess three core aspects: 1. **Real-time Visual Understanding**: Evaluates the model's ability to recognize and interpret objects, actions, and changes in real-time video streams. 2. **Multisource Understanding**: Evaluates the model's ability to simultaneously process visual and audio information in real-time video streams. 3. **Contextual Understanding**: Evaluates the model's ability to understand broader contexts in complex streaming video environments, including detecting anomalies, filtering misleading information, maintaining continuous interaction, and proactively outputting based on predefined conditions. By conducting experiments on 13 open-source and proprietary MLLMs using StreamingBench, the authors found that even the most advanced proprietary models (such as Gemini 1.5 Pro and GPT-4o) fall far short of human levels in streaming video understanding. This indicates that there is still significant room for improvement in MLLMs for streaming video understanding. The authors hope that their work will promote the future development of MLLMs, enabling them to approach human-level video understanding and interaction capabilities in more realistic scenarios.