Abstract:Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability. Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning.

What problem does this paper attempt to address?

This paper attempts to solve the problems encountered by large - language models (LLMs) when calling external tools and APIs in dynamic environments. Specifically, the paper focuses on the tool variability problem caused by the dynamic changes in the external environment. This variability is manifested as changes in API names, parameters, or response formats, which prevent LLMs from correctly calling tools in practical applications, thus affecting their performance and reliability. ### Main Problems 1. **Tool Variability**: Existing research mainly focuses on tool learning in static environments, ignoring the fact that tools and APIs change over time. Such changes may lead to the APIs learned by LLMs being inconsistent with the actually deployed APIs, resulting in the inability to correctly call tools. 2. **Insufficient Adaptability**: Traditional tool - learning methods usually fine - tune LLMs with large - scale tool - use data, and then provide tool manuals or demonstrate tool use through a small number of examples during the inference stage. However, this method performs poorly in the face of tool variability because the tools in the prompts may be obsolete. ### Solutions To address the above problems, the paper proposes a new framework named **TOOLEVO**. TOOLEVO enhances the adaptability and reflective ability of LLMs in the following ways: 1. **Active Exploration**: Utilize the Monte Carlo Tree Search (MCTS) algorithm to enable LLMs to actively explore and interact in dynamic environments. 2. **Self - Reflection and Self - Update**: According to environmental feedback, LLMs can independently reflect on and update tool use, so as to better adapt to tool variability. 3. **New Benchmark**: Construct a new benchmark dataset **ToolQA - D** for evaluating the impact of tool variability. ### Specific Methods - **State Representation**: The state of the current environment includes task descriptions, available API usage, and all actions and environmental feedback along the search path. - **Action Definition**: An action is defined as an API call, including text analysis, tool call, and environmental feedback. - **Dynamic Environment**: The environment provides feedback such as task completion status, API responses, and error messages. - **MCTS Operations**: - **Selection**: Use the PUCT algorithm to traverse from the root node to the leaf node and select the most promising node for exploration. - **Expansion**: After selecting an expandable leaf node, use LLMs to generate candidate actions and expand the tree. - **Simulation**: Adopt the cache rollback strategy for simulation to improve efficiency. - **Backpropagation**: Propagate rewards from the selected leaf node back to the root node to update the number of visits and Q - values. ### Experimental Results - **Static Environment**: Even without fine - tuning for specific tools, TOOLEVO can still significantly outperform other baseline methods and is even comparable to the static - supervised fine - tuning (Static - SFT) method in some cases. - **Dynamic Environment**: When facing tool variability, TOOLEVO shows stronger adaptability and stability and significantly outperforms other methods. - **Generalization Ability**: Under completely different tool variability settings, TOOLEVO still shows good performance, demonstrating its generalization ability. ### Contributions 1. **First Research**: For the first time, systematically study the impact of tool variability on the performance of LLMs. 2. **Adaptive Framework**: Propose the TOOLEVO framework specifically to solve the tool variability problem. 3. **New Benchmark**: Construct the ToolQA - D benchmark dataset to promote further research. 4. **Experimental Verification**: Verify the effectiveness and stability of the method through extensive experiments. In conclusion, this paper solves the tool variability problem encountered by LLMs when calling external tools and APIs in dynamic environments by proposing the TOOLEVO framework, significantly improving the adaptability and reliability of LLMs.

Learning Evolving Tools for Large Language Models

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

Tool Learning with Large Language Models: A Survey

Enhancing Tool Retrieval with Iterative Feedback from Large Language Models

What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

Towards Practical Tool Usage for Continually Learning LLMs

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs

ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph

LLM With Tools: A Survey

Empowering Large Language Models: Tool Learning for Real-World Interaction

Large Language Models As Evolution Strategies

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Making Language Models Better Tool Learners with Execution Feedback

Towards Completeness-Oriented Tool Retrieval for Large Language Models

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum

A Survey on Self-Evolution of Large Language Models

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning