Abstract:In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.

What problem does this paper attempt to address?

The paper attempts to address the limitations of current large language model (LLM) evaluation benchmarks, including restrictions on evaluation content, lack of timely updates, and insufficient optimization guidance. Specifically: 1. **Restrictions on Evaluation Content**: Most current LLM evaluation benchmarks adopt an exam-like format, testing the model's knowledge through a series of fixed questions. However, this static knowledge assessment method is insufficient to comprehensively measure the LLM's performance in real-world applications, especially in terms of its ability to solve tasks in specialized fields. 2. **Lack of Dynamic Updates in Evaluation Datasets**: Information in the real world is constantly changing, yet many existing evaluation datasets are rarely updated after their release. This can result in evaluation outcomes that do not reflect the latest performance of the LLM. For example, in security scenarios, new sensitive events occur daily, necessitating timely updates to the evaluation data to ensure the LLM does not generate unsafe responses. 3. **Evaluation Metrics Insufficient for Guiding Model Optimization**: Existing evaluation methods typically generate only a composite score, lacking in-depth analysis and optimization suggestions for specific issues within the LLM. This makes it difficult for developers to make targeted improvements to the model based on the evaluation results. To address these issues, the paper proposes a new evaluation paradigm—"Benchmarking-Evaluation-Assessment" (BEA), which aims to more comprehensively evaluate the LLM's capabilities through specific task-solving processes and provide detailed optimization suggestions. This new paradigm likens the evaluation process to a medical check-up in a hospital, using progressively deeper evaluation methods to identify and resolve specific issues within the LLM.

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks

Don't Make Your LLM an Evaluation Benchmark Cheater

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

Benchmarking Foundation Models with Language-Model-as-an-Examiner

A Survey on Benchmarks of Multimodal Large Language Models

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

A Survey on Evaluation of Large Language Models

Evaluating Large Language Models: A Comprehensive Survey

Coombs-negative Autoimmune Hemolytic Anemia Followed by Anti-erythropoetin Receptor Antibody-associated Pure Red Cell Aplasia: A Case Report and Review of Literature.

A Survey on Evaluation of Large Language ModelsJust Accepted

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A User-Centric Benchmark for Evaluating Large Language Models.

Large Language Models in Healthcare: A Comprehensive Benchmark