Abstract:Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that 'directly' relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs' attention to visual tokens is notably lower than to system and question tokens. We further observe that attention scores between questions and visual tokens as well as the model's confidence in the answers are lower in response to misleading questions than to straightforward ones. To address the first challenge, we introduce a paired positive and negative data construction pipeline to diversify the dataset. For the second challenge, we propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt. For the text prompt, we propose a content guided refinement strategy that performs preliminary visual content analysis to generate structured information before answering the question. Additionally, we employ a visual attention refinement strategy that highlights question-relevant visual tokens to increase the model's attention to visual content that aligns with the question. Extensive experiments demonstrate that these challenges can be significantly mitigated with our proposed dataset and techniques.

Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Rethinking VLMs and LLMs for Image Classification

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Revisiting Multi-Modal LLM Evaluation

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

InfMLLM: A Unified Framework for Visual-Language Tasks.

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

A Survey on Evaluation of Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models