BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai,Zirui Song,Dayan Guan,Zhenhao Chen,Xing Luo,Chenyu Yi,Alex Kot
2023-12-06
Abstract:Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of large multimodal models (LMMs) in handling images of different styles, particularly their robustness and adaptability when faced with style variations. Specifically, the paper proposes a new benchmarking framework—BenchLMM, to systematically evaluate the performance of LMMs under the following three different styles: 1. **Artistic Styles**: Including images of different artistic styles such as cartoons, paintings, sketches, etc. 2. **Sensor Styles**: Including images obtained from different imaging sensors such as infrared, X-ray, MRI, CT, etc. 3. **Application Styles**: Including images from specific application scenarios such as remote sensing, autonomous driving, robotic action prediction, defect detection, etc. The main findings of the paper include: 1. **Performance Degradation**: LMMs generally show a decline in performance when handling images of uncommon styles. 2. **Style Inconsistency**: Good performance of an LMM on common styles does not necessarily imply excellent performance on other styles. 3. **Style Prompt Enhancement**: Prompting the LMM to recognize the image style before answering questions can significantly improve its reasoning ability. 4. **Error Reflection Ability**: More intelligent LMMs can explain the reasons for their errors when faced with style variations and learn the correct answers from these mistakes. The paper hopes that these findings will provide new perspectives and methods for developing more intelligent and versatile LMMs.