Abstract:Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficiency of the existing benchmark test environment for evaluating large language models (LLMs) in tool - use tasks. Specifically, the existing benchmark tests usually only calculate the proportion of successfully completed tasks, without providing specific analysis of failure cases or detailed feedback on error patterns. This makes it difficult for researchers to deeply understand the root causes of model errors, thus limiting the ability to further improve model performance. To address this challenge, the author introduced a new benchmark test named SpecTool. SpecTool aims to identify and characterize seven common error patterns in LLMs' tool - use tasks and provide researchers with detailed diagnostic feedback to help them develop more effective error - mitigation strategies. SpecTool contains queries from 10 different environmental categories, covering more than 30 tool - specific tasks, making it one of the most comprehensive evaluation environments currently. ### Main Contributions 1. **Introducing the SpecTool Benchmark Test**: Covering 10 environmental categories and more than 30 tasks, it is specifically designed to provide detailed diagnostic feedback for tool - use tasks. 2. **Identifying Seven Common Error Patterns**: A consistent evaluation framework has been created to analyze these error patterns, ensuring evaluation consistency among different agents and environments. 3. **Providing a Human - Annotated Dataset**: Containing 150 queries for detecting error patterns in LLMs' outputs, and demonstrating the effectiveness of SpecTool through case studies. ### Error Patterns The following seven common error patterns are described in the paper: - **Insufficient API Calls (IAC)**: The LLM fails to generate enough API calls to fully complete the task. - **Incorrect Argument Value (IAV)**: The LLM generates incorrect argument values, including omitting necessary arguments. - **Incorrect Argument Name (IAN)**: The LLM generates non - existent argument names. - **Incorrect Argument Type (IAT)**: The LLM generates incorrect argument types. - **Repeated API Calls (RAC)**: The LLM repeatedly generates the same API call, resulting in redundant calls. - **Incorrect Function Name (IFN)**: The LLM generates function names that are not in the API list. - **Invalid Format Error (IFE)**: The LLM is unable to follow the provided format instructions for parsing. Through these contributions, SpecTool not only provides a more in - depth evaluation of model performance but also provides valuable guidance for future LLM development and optimization.

SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

On the Tool Manipulation Capability of Open-source Large Language Models

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Tools Fail: Detecting Silent Errors in Faulty Tools

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

ToolQA: A Dataset for LLM Question Answering with External Tools

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Efficient and Scalable Estimation of Tool Representations in Vector Space

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types

Towards Tool Use Alignment of Large Language Models