SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li,Rui Wang,Guangzhi Wang,Yuying Ge,Yixiao Ge,Ying Shan
2023-08-02
Abstract:Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a new benchmark dataset named SEED-Bench, designed to evaluate the generative understanding capabilities of Multimodal Large Language Models (MLLMs). Specifically, SEED-Bench aims to address the following key issues: 1. **Evaluation Challenges**: Current evaluation methods for MLLMs have limitations, such as being based on a limited number of qualitative examples or traditional benchmarks that are not suitable for MLLMs with open-form outputs. 2. **Objectivity and Efficiency**: Some existing evaluation methods require human or GPT intervention for quality assessment, which not only reduces efficiency but also introduces subjective bias. 3. **Comprehensiveness**: Existing evaluation benchmarks are relatively small in scale, which may lead to unstable evaluation results and insufficient coverage of dimensions. To address the above issues, SEED-Bench makes the following contributions: - **Large-Scale Dataset**: SEED-Bench contains over 19,000 multiple-choice questions, with correct answers annotated by humans, making it significantly larger than existing similar benchmark datasets. - **Wide Range of Evaluation Dimensions**: This benchmark dataset covers 12 evaluation dimensions, including spatial and temporal understanding capabilities in image and video modalities. - **High-Quality Question Generation Process**: By utilizing various foundational models to extract visual information, combined with an automatic filtering mechanism and a manual verification process, the quality of the questions and the accuracy of the answers are ensured. - **Objective Evaluation**: The use of multiple-choice questions avoids reliance on humans or GPT, making model performance evaluation more objective and efficient. - **Model Comparison**: By comprehensively evaluating 18 different types of models, including Large Language Models (LLMs), Image Multimodal Models (ImageLLMs), and Video Multimodal Models (VideoLLMs), SEED-Bench provides a detailed comparative analysis of the generative understanding capabilities of existing models. With the introduction of SEED-Bench, researchers can gain a more comprehensive understanding of the strengths and limitations of different models, thereby promoting the research and development of future Multimodal Large Language Models.