Abstract:Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of open - source large - language models (LLMs) in tool - using capabilities. Although existing open - source LLMs such as LLaMA have made significant progress in basic language tasks, their ability to interact with external tools (APIs) to complete complex tasks is still limited. This is in sharp contrast to the performance of the current state - of - the - art closed - source LLMs (e.g., ChatGPT) in tool - using capabilities. Specifically, the paper points out: 1. **Insufficiency in tool - using capabilities**: Current instruction - tuning mainly focuses on basic language tasks and ignores the field of tool - using. This leads to poor performance of open - source LLMs when dealing with complex tasks that require calling multiple APIs. 2. **Limitations of existing work**: - **Limited number of APIs**: Existing work either does not involve real - world APIs or only involves a small number of APIs with insufficient diversity and coverage. - **Limited scenarios**: Existing work mainly focuses on the use of a single tool and ignores complex scenarios in the real world where multiple tools may need to work together. - **Insufficient planning and reasoning capabilities**: The reasoning methods (such as CoT or ReACT) adopted in existing work cannot fully realize the potential of LLMs, especially when dealing with complex instructions. To solve these problems, the paper introduces ToolLLM, a general - purpose tool - using framework covering data construction, model training, and evaluation. Specifically, the paper proposes the following innovations: 1. **ToolBench dataset**: Automatically constructs a high - quality instruction - tuning dataset containing 16,464 real - world RESTful APIs, covering 49 categories. The construction of the dataset is divided into three stages: API collection, instruction generation, and solution - path annotation. 2. **Enhanced reasoning capabilities**: Develops a decision - tree algorithm based on depth - first search (DFSDT), which can expand the search space and improve the model's reasoning capabilities in complex tasks. 3. **Automatic evaluation tool**: Develops an automatic evaluator ToolEval for evaluating the tool - using capabilities of LLMs, including through two key metrics: pass rate and win rate. 4. **Neural API retriever**: Trains a neural API retriever that can recommend relevant APIs when given an instruction, thereby reducing the need for manual API selection. Through these innovations, the paper aims to improve the performance of open - source LLMs in tool - using, enabling them to better handle complex tasks and have zero - sample generalization capabilities. Experimental results show that ToolLLaMA performs well in handling single - tool and multi - tool instructions, and its performance is close to or even exceeds that of some closed - source models.

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation

Large Language Models as Tool Makers

Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees

Towards Tool Use Alignment of Large Language Models

TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

On the Tool Manipulation Capability of Open-source Large Language Models

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Tool Learning with Large Language Models: A Survey

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Towards Practical Tool Usage for Continually Learning LLMs

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models