Abstract:Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU<a class="link-external link-http" href="http://-Bench.git" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several major limitations in the existing tools when using benchmark datasets to evaluate the tool - using capabilities of large - language models (LLMs): 1. **Insufficient evaluation scenarios**: The tool - using scenarios covered by the existing tool - using benchmark datasets are limited, and they cannot comprehensively evaluate the performance of models in various complex situations. 2. **High evaluation cost**: Many existing benchmark datasets rely on external API tools (such as GPT) for evaluation, resulting in high evaluation costs. 3. **Lack of multi - turn dialogue scenarios**: Some benchmark datasets do not consider multi - turn dialogue scenarios, which limits the evaluation of the model's ability to handle continuous interactions in practical applications. 4. **Insufficient multi - tool - using scenarios**: Some benchmark datasets fail to fully cover multi - tool - using scenarios, affecting the evaluation of the model's comprehensive tool - using capabilities. 5. **Inconsistency between synthetic instructions and actual requirements**: The synthetic instructions used in some benchmark datasets often cannot accurately reflect the actual requirements of real - world users. 6. **Insufficient fine - grained evaluation**: When evaluating tool - using capabilities, many benchmark datasets fail to comprehensively evaluate fine - grained aspects, such as the accuracy of tool - calling sequences, complex tool - calling relationships, etc. To overcome these limitations, the author proposes a multi - grained tool - using benchmark dataset (MTU - Bench), which has the following characteristics: - **Multi - grained tool - using scenarios**: It covers various tool - using scenarios such as single - turn single - tool, single - turn multi - tool, multi - turn single - tool, multi - turn multi - tool, and out - of - distribution tasks. - **Automatic evaluation**: All evaluation metrics are based on prediction results and true labels, without the need to use GPT or human - evaluation metrics. - **Real - world scenario simulation**: Simulate real - world tool - using scenarios by transforming existing high - quality datasets. - **Enhanced instruction dataset**: Propose a dataset named MTU - Instruct to enhance the tool - using capabilities of existing LLMs. Through these improvements, MTU - Bench aims to provide a more comprehensive, more efficient, and more practical tool - using evaluation framework, thereby promoting the research and development of LLMs in tool - using.

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

A User-Centric Benchmark for Evaluating Large Language Models.

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

TaskBench: Benchmarking Large Language Models for Task Automation

NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models