Abstract:The rising popularity of multimodal large language models (MLLMs) has sparked a significant increase in research dedicated to evaluating these models. However, current evaluation studies predominantly concentrate on the ability of models to comprehend and reason within a unimodal (vision-only) context, overlooking critical performance evaluations in complex multimodal reasoning tasks that integrate both visual and text contexts. Furthermore, tasks that demand reasoning across multiple modalities pose greater challenges and require a deep understanding of multimodal contexts. In this paper, we introduce a comprehensive assessment framework named MM-InstructEval, which integrates a diverse array of metrics to provide an extensive evaluation of the performance of various models and instructions across a broad range of multimodal reasoning tasks with vision-text contexts. MM-InstructEval enhances the research on the performance of MLLMs in complex multimodal reasoning tasks, facilitating a more thorough and holistic zero-shot evaluation of MLLMs. We firstly utilize the "Best Performance" metric to determine the upper performance limit of each model across various datasets. The "Mean Relative Gain" metric provides an analysis of the overall performance across different models and instructions, while the "Stability" metric evaluates their sensitivity to variations. Historically, the research has focused on evaluating models independently or solely assessing instructions, overlooking the interplay between models and instructions. To address this gap, we introduce the "Adaptability" metric, designed to quantify the degree of adaptability between models and instructions. Evaluations are conducted on 31 models (23 MLLMs) across 16 multimodal datasets, covering 6 tasks, with 10 distinct instructions. The extensive analysis enables us to derive novel insights.

What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing

POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Visual Prompting in Multimodal Large Language Models: A Survey

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Helping Language Models Learn More: Multi-dimensional Task Prompt for Few-shot Tuning

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

Extensible Prompts for Language Models on Zero-shot Language Style Customization

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement

DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models