Abstract:Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new benchmark dataset named SEED-Bench, designed to evaluate the generative understanding capabilities of Multimodal Large Language Models (MLLMs). Specifically, SEED-Bench aims to address the following key issues: 1. **Evaluation Challenges**: Current evaluation methods for MLLMs have limitations, such as being based on a limited number of qualitative examples or traditional benchmarks that are not suitable for MLLMs with open-form outputs. 2. **Objectivity and Efficiency**: Some existing evaluation methods require human or GPT intervention for quality assessment, which not only reduces efficiency but also introduces subjective bias. 3. **Comprehensiveness**: Existing evaluation benchmarks are relatively small in scale, which may lead to unstable evaluation results and insufficient coverage of dimensions. To address the above issues, SEED-Bench makes the following contributions: - **Large-Scale Dataset**: SEED-Bench contains over 19,000 multiple-choice questions, with correct answers annotated by humans, making it significantly larger than existing similar benchmark datasets. - **Wide Range of Evaluation Dimensions**: This benchmark dataset covers 12 evaluation dimensions, including spatial and temporal understanding capabilities in image and video modalities. - **High-Quality Question Generation Process**: By utilizing various foundational models to extract visual information, combined with an automatic filtering mechanism and a manual verification process, the quality of the questions and the accuracy of the answers are ensured. - **Objective Evaluation**: The use of multiple-choice questions avoids reliance on humans or GPT, making model performance evaluation more objective and efficient. - **Model Comparison**: By comprehensively evaluating 18 different types of models, including Large Language Models (LLMs), Image Multimodal Models (ImageLLMs), and Video Multimodal Models (VideoLLMs), SEED-Bench provides a detailed comparative analysis of the generative understanding capabilities of existing models. With the introduction of SEED-Bench, researchers can gain a more comprehensive understanding of the strengths and limitations of different models, thereby promoting the research and development of future Multimodal Large Language Models.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Benchmarks of Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MMBench: Is Your Multi-modal Model an All-around Player?

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data

MileBench: Benchmarking MLLMs in Long Context

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Benchmarking LLMs' Judgments with No Gold Standard