WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Kangyun Ning,Yisong Su,Xueqiang Lv,Yuanzhe Zhang,Jian Liu,Kang Liu,Jinan Xu
2024-07-02
Abstract:Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how large language models (LLMs) can flexibly judge whether to use external tools in practical applications. Most of the current research assumes that LLMs must use tools, which is inconsistent with the real - world situation, because in reality, the necessity of using tools is uncertain, and inappropriate or unnecessary tool use will damage the overall performance of LLMs. Therefore, the paper proposes a new benchmark test named WTU - Eval, aiming to evaluate whether LLMs can recognize their own ability boundaries and use tools flexibly as needed. Specifically, the goals of the paper include: 1. **Explore the ability boundaries of LLMs**: Research whether LLMs can recognize when to use tools and when not to use tools. 2. **Evaluate the tool - use decision - making ability of LLMs**: Through the WTU - Eval benchmark test, evaluate the performance of LLMs on different datasets, especially their performance in tool - use decision - making. 3. **Improve the tool - use ability of LLMs**: Develop a fine - tuning dataset to enhance the ability of LLMs in tool - use decision - making. ### Main contributions of the paper 1. **Propose the WTU - Eval benchmark test**: This is the first benchmark test specifically designed to evaluate whether LLMs can accurately judge whether to use tools. 2. **Evaluate the performance of multiple well - known LLMs**: Conduct a strict evaluation of eight well - known LLMs and point out their limitations in tool - use decision - making. 3. **Develop a fine - tuning dataset**: Based on the training set of the WTU - Eval benchmark test, develop a fine - tuning dataset containing 4,000 samples to enhance the model's tool - use decision - making ability. Experimental results show that the fine - tuned Llama2 - 7B model has an average performance improvement of 14% and a 16.8% reduction in the wrong tool - use rate. ### Experimental setup - **Dataset**: The WTU - Eval benchmark test includes six tool - use datasets and five general - purpose datasets. - **Tool pool**: Includes machine translators, calculators, search engines, and Wikipedia searches. - **Evaluation metrics**: Mainly focus on accuracy, use advanced methods for evaluation, and introduce the call rate to balance the comparison. ### Experimental results - **Tool - use datasets**: When LLMs can judge whether to use tools and their abilities are close to ChatGPT, their performance on tool - use datasets improves. However, the complexity of tools will affect the performance of LLMs. For example, the complex WolframAlpha calculator will cause a significant decline in the performance of Llama2. - **General - purpose datasets**: When LLMs can judge whether to use tools, their performance on general - purpose datasets generally declines, mainly because LLMs often misuse tools. - **Fine - tuning effect**: Through the fine - tuning dataset, Llama2 - 7B has a significant improvement in both performance and tool - use decision - making ability. ### Discussion - **Error analysis**: The paper conducts an in - depth analysis of failure cases and finds that the main error types include incorrect or unnecessary tool calls, empty content, calling the correct tool but not performing reasoning, repeatedly calling invalid tools, and getting stuck in an infinite retry loop. - **Impact of the fine - tuning method**: Through supervised fine - tuning (SFT), the performance of Llama2 - 7B on general - purpose datasets has been significantly improved, especially in reducing the wrong tool - call rate. In general, through the WTU - Eval benchmark test and the fine - tuning dataset, this paper provides new ideas and methods for improving the ability of LLMs in tool - use decision - making.