Abstract:Tool learning aims to enhance and expand large language models' (LLMs) capabilities with external tools, which has gained significant attention recently. Current methods have shown that LLMs can effectively handle a certain amount of tools through in-context learning or fine-tuning. However, in real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. Tool retrieval is nontrivial due to the following challenges: 1) complex user instructions and tool descriptions; 2) misalignment between tool retrieval and tool usage models. To address the above issues, we propose to enhance tool retrieval with iterative feedback from the large language model. Specifically, we prompt the tool usage model, i.e., the LLM, to provide feedback for the tool retriever model in multi-round, which could progressively improve the tool retriever's understanding of instructions and tools and reduce the gap between the two standalone components. We build a unified and comprehensive benchmark to evaluate tool retrieval models. The extensive experiments indicate that our proposed approach achieves advanced performance in both in-domain evaluation and out-of-domain evaluation.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to address the challenges of large - language models (LLMs) in tool retrieval. Specifically, the paper focuses on the following two main issues:
1. **Complex user instructions and tool descriptions**:
- In practical applications, user instructions are usually vague and complex, and tool descriptions are also relatively complex. This makes it difficult for tool retrieval models to accurately understand user needs and tool functions.
- Compared with document retrieval, tool retrieval faces greater challenges because the matching rate between user instructions and tool descriptions is lower (as shown in Figure 2).
2. **Inconsistency between tool retrieval and tool - use models**:
- Existing methods usually deploy tool retrieval models and tool - use models separately, which makes LLMs unable to understand which tools are truly useful from the perspective of tool use.
- This separation leads to a gap between tool retrieval models and tool - use models, further reducing the performance of tool use.
### Solutions
To address the above problems, the paper proposes an iterative - feedback - based method to enhance tool retrieval. The specific steps are as follows:
1. **Iterative - feedback generation**:
- In each iteration, LLMs will provide feedback based on the current retrieval results. This feedback includes understanding, evaluation of user instructions and retrieved tools, and optimization of user instructions.
- Through multiple iterations, LLMs gradually improve the tool retrieval model's understanding of user instructions and tool functions, reducing the gap between the tool retrieval model and the tool - use model.
2. **Iteration - aware feedback training**:
- During the training process, by adding special iteration markers (such as "Iteration t") in front of user instructions, the tool retrieval model can adapt to continuously optimized user instructions.
- Using hard negative samples for training helps the model better distinguish similar tool descriptions, thereby improving retrieval accuracy.
3. **Comprehensive benchmark testing**:
- The paper constructs a comprehensive tool retrieval benchmark (TR - bench), covering real - world application scenarios, including frequently updated tool sets.
- Through extensive experimental verification, it is proved that the proposed method performs well in both in - domain and out - of - domain settings.
### Experimental results
- **In - domain evaluation**:
- The experimental results show that non - fine - tuned retrieval methods (such as BM25 and Ada Embedding) have poor performance, and although the fine - tuned ToolRetriever is better than non - fine - tuned methods, its performance is still not satisfactory.
- The proposed method significantly outperforms other baseline methods on all evaluation metrics, especially in multi - tool scenarios (I2), demonstrating its robustness in different scenarios.
- **Out - of - domain evaluation**:
- Since tools are often updated in the real world, the paper further tests in out - of - domain settings. The experimental results show that the proposed method performs well in different scenarios, demonstrating its good generalization ability.
- **Ablation study**:
- The ablation study shows that even without warm - up training, the proposed method can still achieve a high NDCG score, indicating that it does not rely on prior tool - use knowledge.
- The introduction of hard negative samples significantly improves the model's discrimination ability, and joint training helps the model balance new and old knowledge.
### Conclusion
By introducing an iterative - feedback mechanism, the paper effectively solves the problems of complex user instructions and tool descriptions in tool retrieval, as well as the inconsistency between tool retrieval models and tool - use models. The experimental results show that the proposed method performs well in tool retrieval tasks and has broad application prospects.