Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Samuel Arcadinho,David Aparicio,Mariana Almeida
2024-10-10
Abstract:Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating tool-augmented large language models (LLMs) as conversational AI agents in real-world applications. Specifically, the paper focuses on the following points: 1. **Diversity and Complexity**: Existing evaluation methods mainly focus on single interactions and function calls, failing to comprehensively assess the performance of LLMs in handling complete conversations. 2. **Real-world Scenarios**: Existing datasets typically cover only simple tasks and do not reflect the complex conversational scenarios in the real world. 3. **Tool Usage**: Evaluating whether LLMs can correctly invoke tools (such as APIs) and follow predefined processes. 4. **Robustness**: Assessing the robustness of LLMs when faced with potential malicious behavior or unexpected operations from users. To address these issues, the authors propose an automatic test data generation method, which includes the following steps: - **Intent Generation**: Using LLM to generate the initial intent of the conversation. - **Process Generation**: Generating specific processing flows based on the intent. - **API Extraction**: Extracting relevant APIs from the process. - **Flowchart Generation**: Converting the process into a flowchart to ensure the generated test data conforms to predefined processes. - **Dialogue Graph Generation**: Converting the flowchart into a dialogue graph to make it closer to real conversation structures. - **Noise Generation**: Introducing noise into the dialogue graph to simulate unexpected behavior or malicious operations from users. - **Path Sampling**: Sampling paths from the dialogue graph to generate diverse conversational scenarios. - **Dialogue Generation**: Generating specific dialogues based on the sampled paths. - **Test Extraction**: Extracting test cases from the generated dialogues to evaluate the performance of AI agents. Additionally, the authors propose a manually annotated dataset, ALMITA, specifically designed to evaluate AI agents in the customer support domain. Through this dataset, the authors benchmarked various LLMs and found that while these models perform well in single interactions and API calls, they still have shortcomings in handling complete conversations.