Abstract:Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, results in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. Based on the GenCeption method, we establish the MMECeption benchmark for evaluating Vision LLMs (VLLMs), and compare performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lack behind human performance and struggle especially with text-intensive tasks.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the deficiencies of evaluation methods for multimodal large - language models (MLLMs). Specifically, the existing MLLMs evaluation methods have the following challenges: 1. **High cost**: Current evaluation methods usually rely on high - quality annotated multimodal datasets, which are not only costly but also have limitations in capturing the rapidly evolving capabilities of MLLMs. 2. **Fast benchmark saturation**: As the model capabilities improve, existing benchmark tests are prone to reach saturation quickly, making it difficult to distinguish the performance differences between different models. 3. **Risk of training data contamination**: The evaluation dataset may overlap with the model's training data, leading to untrue evaluation results. 4. **Irrelevant modal content**: In some benchmark tests, the content of non - text modalities is often not necessary because the answers can be inferred from the questions or the pre - trained knowledge of the model. To address these challenges, the paper proposes **GenCeption**, an evaluation method that does not require annotated data. GenCeption uses unimodal data to evaluate the ability of MLLMs to maintain semantic consistency across different modalities and can inversely evaluate the hallucination tendency of MLLMs. The main features of this method include: - **Low cost**: Using easily accessible unimodal datasets reduces the evaluation cost. - **Reduced training data contamination**: Avoiding the overlap between evaluation data and training data improves the authenticity of the evaluation. - **Harder - to - saturate benchmark**: The use of complex initial samples makes the benchmark test more difficult to reach saturation. - **Continuous evaluation metric**: The continuous **GC@T** metric is introduced, providing more detailed evaluation results than discrete metrics. The paper also introduces the **MMECeption** benchmark test constructed based on the GenCeption method for evaluating visual large - language models (VLLMs) and compares it with human performance. The experimental results verify the effectiveness of GenCeption and show its strong correlation with existing VLLM benchmark tests.

GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

A Survey on Evaluation of Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Are We on the Right Way for Evaluating Large Vision-Language Models?

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

A Survey on Benchmarks of Multimodal Large Language Models

A Survey on Multimodal Large Language Models

LLMs Meet Multimodal Generation and Editing: A Survey

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning