MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Pei Wang,Yanan Wu,Zekun Wang,Jiaheng Liu,Xiaoshuai Song,Zhongyuan Peng,Ken Deng,Chenchen Zhang,Jiakai Wang,Junran Peng,Ge Zhang,Hangyu Guo,Zhaoxiang Zhang,Wenbo Su,Bo Zheng
2024-10-15
Abstract:Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU<a class="link-external link-http" href="http://-Bench.git" rel="external noopener nofollow">this http URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several major limitations in the existing tools when using benchmark datasets to evaluate the tool - using capabilities of large - language models (LLMs): 1. **Insufficient evaluation scenarios**: The tool - using scenarios covered by the existing tool - using benchmark datasets are limited, and they cannot comprehensively evaluate the performance of models in various complex situations. 2. **High evaluation cost**: Many existing benchmark datasets rely on external API tools (such as GPT) for evaluation, resulting in high evaluation costs. 3. **Lack of multi - turn dialogue scenarios**: Some benchmark datasets do not consider multi - turn dialogue scenarios, which limits the evaluation of the model's ability to handle continuous interactions in practical applications. 4. **Insufficient multi - tool - using scenarios**: Some benchmark datasets fail to fully cover multi - tool - using scenarios, affecting the evaluation of the model's comprehensive tool - using capabilities. 5. **Inconsistency between synthetic instructions and actual requirements**: The synthetic instructions used in some benchmark datasets often cannot accurately reflect the actual requirements of real - world users. 6. **Insufficient fine - grained evaluation**: When evaluating tool - using capabilities, many benchmark datasets fail to comprehensively evaluate fine - grained aspects, such as the accuracy of tool - calling sequences, complex tool - calling relationships, etc. To overcome these limitations, the author proposes a multi - grained tool - using benchmark dataset (MTU - Bench), which has the following characteristics: - **Multi - grained tool - using scenarios**: It covers various tool - using scenarios such as single - turn single - tool, single - turn multi - tool, multi - turn single - tool, multi - turn multi - tool, and out - of - distribution tasks. - **Automatic evaluation**: All evaluation metrics are based on prediction results and true labels, without the need to use GPT or human - evaluation metrics. - **Real - world scenario simulation**: Simulate real - world tool - using scenarios by transforming existing high - quality datasets. - **Enhanced instruction dataset**: Propose a dataset named MTU - Instruct to enhance the tool - using capabilities of existing LLMs. Through these improvements, MTU - Bench aims to provide a more comprehensive, more efficient, and more practical tool - using evaluation framework, thereby promoting the research and development of LLMs in tool - using.