Abstract:While various vertical domain large language models (LLMs) have been developed, the challenge of automatically evaluating their performance across different domains remains significant. Current benchmark-based evaluation methods exhibit rigid, aimless interactions and rely on pre-collected static datasets that are costly to build, inflexible across domains, and misaligned with practical user needs. To address this issue, we revisit the evaluation components and introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process, enabling deeper exploration and supporting both quantitative metrics and qualitative insights. These concepts capture the nuanced behaviors of LLMs through richer, multi-turn interactions. We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning. Experiments on tasks ranging from constructing vertical domain evaluation to activating existing benchmarks demonstrate the effectiveness of TestAgent across various scenarios. We believe this work offers an interesting perspective on automatic evaluation for LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced in automatically evaluating the performance of large language models (LLMs) in different vertical domains. There are several main problems with the current benchmark - based evaluation methods: 1. **Static Datasets**: Existing evaluation methods rely on pre - collected static datasets, which are costly to construct, have poor cross - domain flexibility, and are inconsistent with actual user needs. 2. **Lack of Dynamic Interaction**: Traditional evaluation methods usually adopt a fixed question - and - answer format, which is far from the multi - round dialogue scenarios in the real world and cannot fully explore the capabilities of the model. 3. **Single Evaluation Metric**: Existing evaluation metrics are mainly designed to provide numerical comparisons, but lack in - depth exploration of potential problems of the model. Especially in vertical domains, in the absence of comparable models, evaluation becomes more difficult. To solve these problems, the paper proposes two new concepts and a framework: - **Benchmark+**: It extends the traditional question - and - answer benchmark and introduces a more flexible "strategy - criterion" format, making the benchmark test more comprehensive and detailed. - **Assessment+**: It enhances the depth of evaluation through a dynamic interaction process, supports quantitative and qualitative analysis, and can better capture the subtle behaviors of the model. - **TESTAGENT**: A proxy - based evaluation framework is proposed. Through retrieval - augmented generation (RAG) and reinforcement learning (RL), the above concepts are realized, enabling in - depth dynamic evaluation and activating existing static benchmarks. The paper verifies the effectiveness of TESTAGENT in different scenarios through experiments, including building evaluation benchmarks in vertical domains from scratch and activating existing general - domain benchmarks. The experimental results show that TESTAGENT can effectively evaluate the performance of LLMs in multiple fields such as government affairs, healthcare, and reading comprehension.

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

AgentBench: Evaluating LLMs as Agents

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Benchmarking Foundation Models with Language-Model-as-an-Examiner

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Agent-as-a-Judge: Evaluate Agents with Agents

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

Active Evaluation Acquisition for Efficient LLM Benchmarking

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

360^∘REA: Towards A Reusable Experience Accumulation with 360 Assessment for Multi-Agent System

AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents

ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents