Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

Wanying Wang,Zeyu Ma,Pengfei Liu,Mingang Chen
2024-10-16
Abstract:While various vertical domain large language models (LLMs) have been developed, the challenge of automatically evaluating their performance across different domains remains significant. Current benchmark-based evaluation methods exhibit rigid, aimless interactions and rely on pre-collected static datasets that are costly to build, inflexible across domains, and misaligned with practical user needs. To address this issue, we revisit the evaluation components and introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process, enabling deeper exploration and supporting both quantitative metrics and qualitative insights. These concepts capture the nuanced behaviors of LLMs through richer, multi-turn interactions. We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning. Experiments on tasks ranging from constructing vertical domain evaluation to activating existing benchmarks demonstrate the effectiveness of TestAgent across various scenarios. We believe this work offers an interesting perspective on automatic evaluation for LLMs.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced in automatically evaluating the performance of large language models (LLMs) in different vertical domains. There are several main problems with the current benchmark - based evaluation methods: 1. **Static Datasets**: Existing evaluation methods rely on pre - collected static datasets, which are costly to construct, have poor cross - domain flexibility, and are inconsistent with actual user needs. 2. **Lack of Dynamic Interaction**: Traditional evaluation methods usually adopt a fixed question - and - answer format, which is far from the multi - round dialogue scenarios in the real world and cannot fully explore the capabilities of the model. 3. **Single Evaluation Metric**: Existing evaluation metrics are mainly designed to provide numerical comparisons, but lack in - depth exploration of potential problems of the model. Especially in vertical domains, in the absence of comparable models, evaluation becomes more difficult. To solve these problems, the paper proposes two new concepts and a framework: - **Benchmark+**: It extends the traditional question - and - answer benchmark and introduces a more flexible "strategy - criterion" format, making the benchmark test more comprehensive and detailed. - **Assessment+**: It enhances the depth of evaluation through a dynamic interaction process, supports quantitative and qualitative analysis, and can better capture the subtle behaviors of the model. - **TESTAGENT**: A proxy - based evaluation framework is proposed. Through retrieval - augmented generation (RAG) and reinforcement learning (RL), the above concepts are realized, enabling in - depth dynamic evaluation and activating existing static benchmarks. The paper verifies the effectiveness of TESTAGENT in different scenarios through experiments, including building evaluation benchmarks in vertical domains from scratch and activating existing general - domain benchmarks. The experimental results show that TESTAGENT can effectively evaluate the performance of LLMs in multiple fields such as government affairs, healthcare, and reading comprehension.