Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Siyuan Wang,Zhuohan Long,Zhihao Fan,Zhongyu Wei,Xuanjing Huang
2024-02-18
Abstract:This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address several key issues in the evaluation of large language models (LLMs): 1. **Insufficiency of Static Benchmark Datasets**: - With the rapid development of LLMs and the emergence of new capabilities, existing static benchmark datasets gradually fail to comprehensively evaluate these models' abilities and limitations. - Static datasets are prone to data contamination, where training data may include test data, leading to distorted evaluation results. 2. **Need for Dynamic Evaluation**: - There is a need for an evaluation framework that can evolve with the development of LLMs to more accurately reflect the models' performance on different tasks. - Dynamic evaluation can help reveal the models' generalization ability and robustness when faced with diverse and complex queries. 3. **Limitations of Existing Methods**: - Existing evaluation methods either rely too heavily on a single metric (such as perplexity), failing to comprehensively reflect the model's performance. - Or they lack generalization ability when generating new test samples, making them difficult to apply to all types of tasks. ### Solution To address the above issues, the paper proposes a multi-agent framework called "Benchmark Self-Evolving Framework." This framework achieves dynamic evaluation through the following methods: 1. **Multi-Agent System**: - Utilizes a multi-agent system to modify the context or questions of existing benchmark instances, generating new evolved instances. - Ensures the accuracy of generated instances through four key components (instance pre-filter, instance creator, instance validator, and candidate option generator). 2. **Scalable Evaluation**: - By creating alternative or more complex questions, it examines the LLMs' generalization ability when faced with diverse and challenging queries. 3. **Robust Evaluation**: - Introduces various perturbation strategies in the context (such as synonym replacement, noise addition, polarity reversal) to test the LLMs' sensitivity and adaptability to data noise. 4. **Fine-Grained Evaluation**: - Generates sub-capability questions to deeply examine the LLMs' specific abilities in problem-solving, including task planning, implicit knowledge recognition, and relevant context retrieval. ### Experimental Results - **Performance Decline**: Most LLMs perform generally lower in dynamic evaluation compared to original evaluation results, more accurately reflecting the models' true capabilities. - **Performance Gap Widening**: Dynamic evaluation not only widens the performance gap between different models but also reveals performance differences of the same model on different tasks, helping to select the most suitable model for specific applications. - **Sub-Capability Analysis**: Fine-grained evaluation reveals deficiencies in certain sub-capabilities of the models, especially in task planning ability, providing directions for future improvements. In summary, the paper proposes a dynamic evaluation framework to address the insufficiencies of existing static benchmark datasets, providing a new method for more accurately evaluating the capabilities and limitations of LLMs.