How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak,Muhammad Ferjad Naeem,Jameel Hassan,Muzammal Naseer,Federico Tombari,Fahad Shahbaz Khan,Salman Khan
2024-05-09
Abstract:Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of evaluating the capabilities of large multimodal video models (Video-LMMs) in complex video understanding and reasoning, and proposes a new benchmark suite—Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES). Specifically, the paper focuses on the following aspects: 1. **Limitations of existing benchmarks**: Current Video-LMMs benchmarks mainly focus on general video understanding capabilities, neglecting the reasoning ability and robustness of these models when handling complex videos in real-world scenarios. 2. **Proposing a new benchmark**: CVRR-ES covers 11 diverse video dimensions to comprehensively evaluate the performance of Video-LMMs in complex videos. These dimensions include multiple action recognition, fine-grained action understanding, partial action recognition, temporal sequence understanding, etc. 3. **Model performance evaluation**: The paper evaluates 9 recent Video-LMMs and finds that most models (especially open-source models) lack reasoning ability and robustness when dealing with complex videos. 4. **Improvement techniques**: Based on the above analysis, the paper develops a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Through this series of work, the paper provides valuable insights for building the next generation of AI systems with stronger reasoning abilities and robustness.