Abstract:The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios.

What problem does this paper attempt to address?

This paper attempts to address the inadequacies of multimodal large language models (MLLMs) in streaming video understanding. Although existing MLLMs perform well in offline video understanding tasks, they exhibit significant gaps when handling real-time streaming video input. Specifically, these models need to preload all video frames before querying, and cannot watch, listen, think, and respond to streaming input in real-time like humans. This indicates that current MLLMs have clear limitations in real-time streaming video understanding. To evaluate and advance the development of MLLMs in streaming video understanding, the authors propose a benchmark called **StreamingBench**. StreamingBench includes 18 tasks, involving 900 videos and 4,500 manually annotated question-answer pairs. Each video poses five questions at different time points to simulate continuous streaming scenarios. These tasks primarily assess three core aspects: 1. **Real-time Visual Understanding**: Evaluates the model's ability to recognize and interpret objects, actions, and changes in real-time video streams. 2. **Multisource Understanding**: Evaluates the model's ability to simultaneously process visual and audio information in real-time video streams. 3. **Contextual Understanding**: Evaluates the model's ability to understand broader contexts in complex streaming video environments, including detecting anomalies, filtering misleading information, maintaining continuous interaction, and proactively outputting based on predefined conditions. By conducting experiments on 13 open-source and proprietary MLLMs using StreamingBench, the authors found that even the most advanced proprietary models (such as Gemini 1.5 Pro and GPT-4o) fall far short of human levels in streaming video understanding. This indicates that there is still significant room for improvement in MLLMs for streaming video understanding. The authors hope that their work will promote the future development of MLLMs, enabling them to approach human-level video understanding and interaction capabilities in more realistic scenarios.

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Streaming Long Video Understanding with Large Language Models

VideoLLM-online: Online Video Large Language Model for Streaming Video

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

LVBench: An Extreme Long Video Understanding Benchmark

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Efficient Streaming Language Models with Attention Sinks

Understanding Long Videos with Multimodal Language Models

MileBench: Benchmarking MLLMs in Long Context

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos