What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks

Chengrui Huang,Zhengliang Shi,Yuntao Wen,Xiuying Chen,Peng Han,Shen Gao,Shuo Shang
2024-07-03
Abstract:Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore the performance stability issues of the tool - learning framework under different tasks, datasets, training settings and algorithms. Specifically, through extensive experiments, the researchers systematically analyzed the influence of internal factors (such as the choice of the base language model, decoding temperature, maximum inference steps, etc.) and external factors (such as user query styles, system prompts, the order and scale of the candidate tool sets, etc.) on the performance of the tool - learning framework. The core objective of the paper is to reveal how these factors affect the stability and effectiveness of the tool - learning model and provide a new perspective for future tool - learning research. ### Main findings 1. **There are obvious instabilities in the existing tool - use work - flow**: Even the most advanced methods still show instability when faced with slight perturbations. 2. **The influence of internal factors**: - **Hyperparameter settings**: Appropriate hyperparameter settings can improve the ability of LLM to generate diverse solutions, but may also lead to instability. - **Base language model selection**: Closed - source models are significantly better than open - source models in terms of success rate and invalid selection rate, and as the parameter scale increases, both performance and stability improve. 3. **The influence of external factors**: - **The order and scale of the candidate tool sets**: LLM is very sensitive to the order changes of the candidate tool sets, especially in large - scale tool sets, and is prone to continuously select useless tools. - **System prompts**: Optimized system prompts can significantly improve the performance of the model, but different prompt strategies will lead to different results. - **User behavior**: Although LLM is relatively stable for users' concise or detailed descriptions, it still shows some instability when dealing with large - scale candidate tool sets. ### Experimental design 1. **Datasets**: The researchers used two subsets in the widely - used ToolBench benchmark dataset, and each subset contains 200 tasks involving various practical applications. 2. **Evaluation metrics**: Including success rate (Success Rate), give - up rate (Give Up Rate), invalid selection rate (Invalid Selection Rate) and T - test (T - test). 3. **Tool - learning frameworks**: Mainly used two frameworks, ReAct and DFSDT, for comparison. ### Conclusion Through systematic empirical research, the paper reveals the performance and its instability of the tool - learning framework under different factors. The research results provide an important reference for future tool - learning research and applications, especially in how to optimize the tool - selection module in practical scenarios to improve the stability and effectiveness of the model.