Abstract:In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: <a class="link-external link-https" href="https://motion-bench.github.io" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

This paper attempts to address the problem of the lack of ability of video understanding models in fine - grained motion understanding. Specifically: 1. **Limitations of Existing Benchmarks**: Current video understanding benchmarks mainly focus on event - level and story - level understanding, while ignoring the understanding of fine - grained motion. This omission leads to insufficient evaluation data volume and diversity, especially in motion dynamics evaluation. 2. **The Contradiction between High Frame Rate and Computational Cost**: In order to achieve fine - grained motion understanding, high - frame - rate videos need to be processed, but this will bring high computational and memory costs. Therefore, existing video understanding models can only process a limited number of frames and cannot meet the needs of fine - grained motion analysis. 3. **Limited Fine - grained Motion Understanding Ability**: Even if the frame rate is increased, existing models still perform poorly in motion - level understanding, with an accuracy rate lower than 60%, indicating that the basic capabilities of the models are limited and it is difficult to cope with complex motion scenarios. To solve these problems, the paper proposes the following methods: - **MotionBench Benchmark**: A benchmark specifically designed to evaluate video understanding models in fine - grained motion understanding. It contains 8,052 questions from different sources, covering six major motion categories and ensuring a wide representation of real - world video content. - **Through - Encoder Fusion (TE Fusion) Method**: A new video feature compression architecture that improves the ability of video feature representation by applying deep fusion techniques throughout the visual encoder, especially performing well in the case of high compression ratios. Through these methods, the paper aims to guide and inspire the development of more powerful video understanding models, especially emphasizing the importance of fine - grained motion understanding. ### Formula Summary The formulas involved in the paper are mainly used to define "Annotation Density" to quantify the amount of annotation information within each second: \[ \text{Annotation Density} = \frac{\text{Total length of questions}}{\text{Video duration}} \] This formula helps to evaluate the amount of annotation information per second in MotionBench, thereby ensuring that the benchmark can effectively evaluate the fine - grained motion understanding ability of video understanding models.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

LVBench: An Extreme Long Video Understanding Benchmark

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Towards Event-oriented Long Video Understanding

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

MMBench: Is Your Multi-modal Model an All-around Player?

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding