Abstract:Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

AuditLLM: a tool for auditing large language models using multiprobe approach

AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach

LLMAuditor: A Framework for Auditing Large Language Models Using Human-in-the-Loop

Auditing large language models: a three-layered approach

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Supporting Human-AI Collaboration in Auditing LLMs with LLMs

AuditWen:An Open-Source Large Language Model for Audit

ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Assessing the Reliability of Large Language Model Knowledge

LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

Apprentices to Research Assistants: Advancing Research with Large Language Models

Evaluating LLMs at Detecting Errors in LLM Responses

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent