GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data

Lele Cao,Valentin Buchner,Zineb Senane,Fangkai Yang
2024-07-23
Abstract:Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, results in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. Based on the GenCeption method, we establish the MMECeption benchmark for evaluating Vision LLMs (VLLMs), and compare performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lack behind human performance and struggle especially with text-intensive tasks.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the deficiencies of evaluation methods for multimodal large - language models (MLLMs). Specifically, the existing MLLMs evaluation methods have the following challenges: 1. **High cost**: Current evaluation methods usually rely on high - quality annotated multimodal datasets, which are not only costly but also have limitations in capturing the rapidly evolving capabilities of MLLMs. 2. **Fast benchmark saturation**: As the model capabilities improve, existing benchmark tests are prone to reach saturation quickly, making it difficult to distinguish the performance differences between different models. 3. **Risk of training data contamination**: The evaluation dataset may overlap with the model's training data, leading to untrue evaluation results. 4. **Irrelevant modal content**: In some benchmark tests, the content of non - text modalities is often not necessary because the answers can be inferred from the questions or the pre - trained knowledge of the model. To address these challenges, the paper proposes **GenCeption**, an evaluation method that does not require annotated data. GenCeption uses unimodal data to evaluate the ability of MLLMs to maintain semantic consistency across different modalities and can inversely evaluate the hallucination tendency of MLLMs. The main features of this method include: - **Low cost**: Using easily accessible unimodal datasets reduces the evaluation cost. - **Reduced training data contamination**: Avoiding the overlap between evaluation data and training data improves the authenticity of the evaluation. - **Harder - to - saturate benchmark**: The use of complex initial samples makes the benchmark test more difficult to reach saturation. - **Continuous evaluation metric**: The continuous **GC@T** metric is introduced, providing more detailed evaluation results than discrete metrics. The paper also introduces the **MMECeption** benchmark test constructed based on the GenCeption method for evaluating visual large - language models (VLLMs) and compares it with human performance. The experimental results verify the effectiveness of GenCeption and show its strong correlation with existing VLLM benchmark tests.