Abstract:Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating tool-augmented large language models (LLMs) as conversational AI agents in real-world applications. Specifically, the paper focuses on the following points: 1. **Diversity and Complexity**: Existing evaluation methods mainly focus on single interactions and function calls, failing to comprehensively assess the performance of LLMs in handling complete conversations. 2. **Real-world Scenarios**: Existing datasets typically cover only simple tasks and do not reflect the complex conversational scenarios in the real world. 3. **Tool Usage**: Evaluating whether LLMs can correctly invoke tools (such as APIs) and follow predefined processes. 4. **Robustness**: Assessing the robustness of LLMs when faced with potential malicious behavior or unexpected operations from users. To address these issues, the authors propose an automatic test data generation method, which includes the following steps: - **Intent Generation**: Using LLM to generate the initial intent of the conversation. - **Process Generation**: Generating specific processing flows based on the intent. - **API Extraction**: Extracting relevant APIs from the process. - **Flowchart Generation**: Converting the process into a flowchart to ensure the generated test data conforms to predefined processes. - **Dialogue Graph Generation**: Converting the flowchart into a dialogue graph to make it closer to real conversation structures. - **Noise Generation**: Introducing noise into the dialogue graph to simulate unexpected behavior or malicious operations from users. - **Path Sampling**: Sampling paths from the dialogue graph to generate diverse conversational scenarios. - **Dialogue Generation**: Generating specific dialogues based on the sampled paths. - **Test Extraction**: Extracting test cases from the generated dialogues to evaluate the performance of AI agents. Additionally, the authors propose a manually annotated dataset, ALMITA, specifically designed to evaluate AI agents in the customer support domain. Through this dataset, the authors benchmarked various LLMs and found that while these models perform well in single interactions and API calls, they still have shortcomings in handling complete conversations.

Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Simulating User Agents for Embodied Conversational-AI

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers

Enhancing Pipeline-Based Conversational Agents with Large Language Models

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

GameEval: Evaluating LLMs on Conversational Games

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Synthetic Dialogue Dataset Generation using LLM Agents

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

MORTAR: Metamorphic Multi-turn Testing for LLM-based Dialogue Systems

AgentBench: Evaluating LLMs as Agents

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Assessing and Verifying Task Utility in LLM-Powered Applications

LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

Coding Reliable LLM-based Integrated Task and Knowledge Agents with GenieWorksheets