Abstract:This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address several key issues in the evaluation of large language models (LLMs): 1. **Insufficiency of Static Benchmark Datasets**: - With the rapid development of LLMs and the emergence of new capabilities, existing static benchmark datasets gradually fail to comprehensively evaluate these models' abilities and limitations. - Static datasets are prone to data contamination, where training data may include test data, leading to distorted evaluation results. 2. **Need for Dynamic Evaluation**: - There is a need for an evaluation framework that can evolve with the development of LLMs to more accurately reflect the models' performance on different tasks. - Dynamic evaluation can help reveal the models' generalization ability and robustness when faced with diverse and complex queries. 3. **Limitations of Existing Methods**: - Existing evaluation methods either rely too heavily on a single metric (such as perplexity), failing to comprehensively reflect the model's performance. - Or they lack generalization ability when generating new test samples, making them difficult to apply to all types of tasks. ### Solution To address the above issues, the paper proposes a multi-agent framework called "Benchmark Self-Evolving Framework." This framework achieves dynamic evaluation through the following methods: 1. **Multi-Agent System**: - Utilizes a multi-agent system to modify the context or questions of existing benchmark instances, generating new evolved instances. - Ensures the accuracy of generated instances through four key components (instance pre-filter, instance creator, instance validator, and candidate option generator). 2. **Scalable Evaluation**: - By creating alternative or more complex questions, it examines the LLMs' generalization ability when faced with diverse and challenging queries. 3. **Robust Evaluation**: - Introduces various perturbation strategies in the context (such as synonym replacement, noise addition, polarity reversal) to test the LLMs' sensitivity and adaptability to data noise. 4. **Fine-Grained Evaluation**: - Generates sub-capability questions to deeply examine the LLMs' specific abilities in problem-solving, including task planning, implicit knowledge recognition, and relevant context retrieval. ### Experimental Results - **Performance Decline**: Most LLMs perform generally lower in dynamic evaluation compared to original evaluation results, more accurately reflecting the models' true capabilities. - **Performance Gap Widening**: Dynamic evaluation not only widens the performance gap between different models but also reveals performance differences of the same model on different tasks, helping to select the most suitable model for specific applications. - **Sub-Capability Analysis**: Fine-grained evaluation reveals deficiencies in certain sub-capabilities of the models, especially in task planning ability, providing directions for future improvements. In summary, the paper proposes a dynamic evaluation framework to address the insufficiencies of existing static benchmark datasets, providing a new method for more accurately evaluating the capabilities and limitations of LLMs.

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

AgentBench: Evaluating LLMs as Agents

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

SELF: Self-Evolution with Language Feedback

A Survey on Self-Evolution of Large Language Models

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

Active Evaluation Acquisition for Efficient LLM Benchmarking

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond

Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models