How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak,Muhammad Ferjad Naeem,Jameel Hassan,Muzammal Naseer,Federico Tombari,Fahad Shahbaz Khan,Salman Khan

2024-05-09

Abstract:Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at:

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of evaluating the capabilities of large multimodal video models (Video-LMMs) in complex video understanding and reasoning, and proposes a new benchmark suite—Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES). Specifically, the paper focuses on the following aspects: 1. **Limitations of existing benchmarks**: Current Video-LMMs benchmarks mainly focus on general video understanding capabilities, neglecting the reasoning ability and robustness of these models when handling complex videos in real-world scenarios. 2. **Proposing a new benchmark**: CVRR-ES covers 11 diverse video dimensions to comprehensively evaluate the performance of Video-LMMs in complex videos. These dimensions include multiple action recognition, fine-grained action understanding, partial action recognition, temporal sequence understanding, etc. 3. **Model performance evaluation**: The paper evaluates 9 recent Video-LMMs and finds that most models (especially open-source models) lack reasoning ability and robustness when dealing with complex videos. 4. **Improvement techniques**: Based on the above analysis, the paper develops a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Through this series of work, the paper provides valuable insights for building the next generation of AI systems with stronger reasoning abilities and robustness.

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

VideoQA in the Era of LLMs: An Empirical Study

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data

Look, Remember and Reason: Grounded reasoning in videos with language models

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

MLVU: Benchmarking Multi-task Long Video Understanding

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Context-Enhanced Video Moment Retrieval with Large Language Models

Do Current Video LLMs Have Strong OCR Abilities? A Preliminary Study

Understanding Long Videos with Multimodal Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding