Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Jin Liu,Qingquan Li,Wenlong Du
2024-07-10
Abstract:In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the limitations of current large language model (LLM) evaluation benchmarks, including restrictions on evaluation content, lack of timely updates, and insufficient optimization guidance. Specifically: 1. **Restrictions on Evaluation Content**: Most current LLM evaluation benchmarks adopt an exam-like format, testing the model's knowledge through a series of fixed questions. However, this static knowledge assessment method is insufficient to comprehensively measure the LLM's performance in real-world applications, especially in terms of its ability to solve tasks in specialized fields. 2. **Lack of Dynamic Updates in Evaluation Datasets**: Information in the real world is constantly changing, yet many existing evaluation datasets are rarely updated after their release. This can result in evaluation outcomes that do not reflect the latest performance of the LLM. For example, in security scenarios, new sensitive events occur daily, necessitating timely updates to the evaluation data to ensure the LLM does not generate unsafe responses. 3. **Evaluation Metrics Insufficient for Guiding Model Optimization**: Existing evaluation methods typically generate only a composite score, lacking in-depth analysis and optimization suggestions for specific issues within the LLM. This makes it difficult for developers to make targeted improvements to the model based on the evaluation results. To address these issues, the paper proposes a new evaluation paradigm—"Benchmarking-Evaluation-Assessment" (BEA), which aims to more comprehensively evaluate the LLM's capabilities through specific task-solving processes and provide detailed optimization suggestions. This new paradigm likens the evaluation process to a medical check-up in a hospital, using progressively deeper evaluation methods to identify and resolve specific issues within the LLM.