SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Shirley Kokane,Ming Zhu,Tulika Awalgaonkar,Jianguo Zhang,Thai Hoang,Akshara Prabhakar,Zuxin Liu,Tian Lan,Liangwei Yang,Juntao Tan,Rithesh Murthy,Weiran Yao,Zhiwei Liu,Juan Carlos Niebles,Huan Wang,Shelby Heinecke,Caiming Xiong,Silivo Savarese
2024-11-21
Abstract:Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of the existing benchmark test environment for evaluating large language models (LLMs) in tool - use tasks. Specifically, the existing benchmark tests usually only calculate the proportion of successfully completed tasks, without providing specific analysis of failure cases or detailed feedback on error patterns. This makes it difficult for researchers to deeply understand the root causes of model errors, thus limiting the ability to further improve model performance. To address this challenge, the author introduced a new benchmark test named SpecTool. SpecTool aims to identify and characterize seven common error patterns in LLMs' tool - use tasks and provide researchers with detailed diagnostic feedback to help them develop more effective error - mitigation strategies. SpecTool contains queries from 10 different environmental categories, covering more than 30 tool - specific tasks, making it one of the most comprehensive evaluation environments currently. ### Main Contributions 1. **Introducing the SpecTool Benchmark Test**: Covering 10 environmental categories and more than 30 tasks, it is specifically designed to provide detailed diagnostic feedback for tool - use tasks. 2. **Identifying Seven Common Error Patterns**: A consistent evaluation framework has been created to analyze these error patterns, ensuring evaluation consistency among different agents and environments. 3. **Providing a Human - Annotated Dataset**: Containing 150 queries for detecting error patterns in LLMs' outputs, and demonstrating the effectiveness of SpecTool through case studies. ### Error Patterns The following seven common error patterns are described in the paper: - **Insufficient API Calls (IAC)**: The LLM fails to generate enough API calls to fully complete the task. - **Incorrect Argument Value (IAV)**: The LLM generates incorrect argument values, including omitting necessary arguments. - **Incorrect Argument Name (IAN)**: The LLM generates non - existent argument names. - **Incorrect Argument Type (IAT)**: The LLM generates incorrect argument types. - **Repeated API Calls (RAC)**: The LLM repeatedly generates the same API call, resulting in redundant calls. - **Incorrect Function Name (IFN)**: The LLM generates function names that are not in the API list. - **Invalid Format Error (IFE)**: The LLM is unable to follow the provided format instructions for parsing. Through these contributions, SpecTool not only provides a more in - depth evaluation of model performance but also provides valuable guidance for future LLM development and optimization.