Abstract:Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore the performance stability issues of the tool - learning framework under different tasks, datasets, training settings and algorithms. Specifically, through extensive experiments, the researchers systematically analyzed the influence of internal factors (such as the choice of the base language model, decoding temperature, maximum inference steps, etc.) and external factors (such as user query styles, system prompts, the order and scale of the candidate tool sets, etc.) on the performance of the tool - learning framework. The core objective of the paper is to reveal how these factors affect the stability and effectiveness of the tool - learning model and provide a new perspective for future tool - learning research. ### Main findings 1. **There are obvious instabilities in the existing tool - use work - flow**: Even the most advanced methods still show instability when faced with slight perturbations. 2. **The influence of internal factors**: - **Hyperparameter settings**: Appropriate hyperparameter settings can improve the ability of LLM to generate diverse solutions, but may also lead to instability. - **Base language model selection**: Closed - source models are significantly better than open - source models in terms of success rate and invalid selection rate, and as the parameter scale increases, both performance and stability improve. 3. **The influence of external factors**: - **The order and scale of the candidate tool sets**: LLM is very sensitive to the order changes of the candidate tool sets, especially in large - scale tool sets, and is prone to continuously select useless tools. - **System prompts**: Optimized system prompts can significantly improve the performance of the model, but different prompt strategies will lead to different results. - **User behavior**: Although LLM is relatively stable for users' concise or detailed descriptions, it still shows some instability when dealing with large - scale candidate tool sets. ### Experimental design 1. **Datasets**: The researchers used two subsets in the widely - used ToolBench benchmark dataset, and each subset contains 200 tasks involving various practical applications. 2. **Evaluation metrics**: Including success rate (Success Rate), give - up rate (Give Up Rate), invalid selection rate (Invalid Selection Rate) and T - test (T - test). 3. **Tool - learning frameworks**: Mainly used two frameworks, ReAct and DFSDT, for comparison. ### Conclusion Through systematic empirical research, the paper reveals the performance and its instability of the tool - learning framework under different factors. The research results provide an important reference for future tool - learning research and applications, especially in how to optimize the tool - selection module in practical scenarios to improve the stability and effectiveness of the model.

What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks

Learning Evolving Tools for Large Language Models

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Enhancing Tool Retrieval with Iterative Feedback from Large Language Models

Tool Learning with Large Language Models: A Survey

On the Tool Manipulation Capability of Open-source Large Language Models

ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph

Towards Tool Use Alignment of Large Language Models

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

Tool Learning with Foundation Models

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Towards Practical Tool Usage for Continually Learning LLMs

What Are Tools Anyway? A Survey from the Language Model Perspective

From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions