Abstract:Large language models (LLMs) like ChatGPT have revealed amazing intelligence. How to evaluate the question-solving abilities of LLMs and their degrees of intelligence is a hot-spot but challenging issue. First, the question-solving abilities are interlaced with different ability branches like understanding and massive knowledge categories like mathematics. Second, the inputs of questions are multimodal that may involve text and images. Third, the response format of LLMs is diverse and thus poses great challenges for result extraction and evaluation. In this paper, we propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs. Instead of a collection of blended questions, AGIBench focuses on three typical ability branches and adopts a four-tuple <ability branch, knowledge, difficulty, modal> to label the attributes of each question. First, it supports multi-granularity benchmarking, e.g., per-question, per-ability branch, per-knowledge, per-modal, per-dataset, and per-difficulty level granularities. Second, it contains multimodal input, including text and images. Third, it classifies all the questions into five degrees of difficulty according to the average accuracy rate of abundant educated humans (human-referenced). Fourth, it adopts zero-shot learning to avoid introducing additional unpredictability and provides an auto-scoring method to extract and judge the result. Finally, it defines multi-dimensional metrics, including accuracy under the average, worst, best, and majority voting cases, and repeatability. AGIBench is publically available from \url{<a class="link-external link-https" href="https://www.benchcouncil.org/agibench" rel="external noopener nofollow">this https URL</a>}.

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Systematic Inequalities in Language Technology Performance across the World's Languages

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

MILU: A Multi-task Indic Language Understanding Benchmark

Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

XferBench: a Data-Driven Benchmark for Emergent Language

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark