BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai,Zirui Song,Dayan Guan,Zhenhao Chen,Xing Luo,Chenyu Yi,Alex Kot

2023-12-06

Abstract:Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of large multimodal models (LMMs) in handling images of different styles, particularly their robustness and adaptability when faced with style variations. Specifically, the paper proposes a new benchmarking framework—BenchLMM, to systematically evaluate the performance of LMMs under the following three different styles: 1. **Artistic Styles**: Including images of different artistic styles such as cartoons, paintings, sketches, etc. 2. **Sensor Styles**: Including images obtained from different imaging sensors such as infrared, X-ray, MRI, CT, etc. 3. **Application Styles**: Including images from specific application scenarios such as remote sensing, autonomous driving, robotic action prediction, defect detection, etc. The main findings of the paper include: 1. **Performance Degradation**: LMMs generally show a decline in performance when handling images of uncommon styles. 2. **Style Inconsistency**: Good performance of an LMM on common styles does not necessarily imply excellent performance on other styles. 3. **Style Prompt Enhancement**: Prompting the LMM to recognize the image style before answering questions can significantly improve its reasoning ability. 4. **Error Reflection Ability**: More intelligent LMMs can explain the reasons for their errors when faced with style variations and learn the correct answers from these mistakes. The paper hopes that these findings will provide new perspectives and methods for developing more intelligent and versatile LMMs.

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Benchmarks of Multimodal Large Language Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Are We on the Right Way for Evaluating Large Vision-Language Models?

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

A Survey on Multimodal Benchmarks: In the Era of Large AI Models