Abstract:Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two main problems that large - language models (LLMs) have when using external tools: 1. **Dependence on task - specific demonstrations**: Previous work overly depends on task - specific tool - use demonstrations, which limits their generalization ability and computational cost because it requires frequent invocation of large - scale LLMs. 2. **Low computational efficiency**: Existing methods require a large number of invocations of large - language models during tool selection and execution, resulting in high computational costs and low efficiency. To overcome these problems, the authors propose **GEAR** (Generalizable and Efficient Tool Resolution), a computationally efficient query - tool alignment algorithm that can generalize to various tasks without task - specific demonstrations. The main features of GEAR include: - **Efficiency**: By delegating tool alignment and execution to small - language models (SLMs) and large - language models (LLMs) respectively, the computational cost is reduced. - **Generalization ability**: By using semantic and pattern evaluation to perform general tool alignment at the question and answer levels, the generalization ability to new tasks, new tools, and different SLMs is improved. - **Accuracy**: In terms of tool alignment, GEAR has higher precision than existing LLM - prompt - based methods, thereby improving the accuracy of downstream tasks. ### Main contributions 1. **Proposed a new query - tool alignment algorithm**: GEAR selects the most appropriate tool by combining semantic similarity and pattern similarity, improving the accuracy and generalization ability of tool alignment. 2. **Improved computational efficiency**: By assigning most of the computational tasks to small - language models, the number of invocations of large - language models is reduced, significantly reducing the computational cost. 3. **Extensive experimental verification**: Experiments were carried out on 14 datasets, covering 6 downstream tasks, demonstrating GEAR's strong generalization ability on new tasks, new tools, and different SLMs. ### Experimental results - **Downstream task performance**: In a tool library containing 4 basic tools, GEAR outperforms all baseline models on four basic tasks. For example, in the open - domain question - answering task (ODQA), the accuracy of GPT - J enhanced by GEAR is 24.3% and 6.7% higher than that of the zero - sample and few - sample baselines. - **Tool - alignment accuracy**: In a tool library expanded to 10 tools, GEAR performs excellently in terms of tool - alignment accuracy, especially in arithmetic and machine - translation tasks. For more open natural - language - processing tasks, such as open - domain question - answering and common - sense question - answering, GEAR's alignment strategy also shows stronger generalization ability. ### Conclusion GEAR significantly improves the performance and computational efficiency of large - language models when using external tools through an efficient and generalized query - tool alignment algorithm, providing a new direction for future research.

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Large Language Models as Tool Makers

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

ToolGen: Unified Tool Retrieval and Calling via Generation

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

LLM With Tools: A Survey

Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Equipping Language Models with Tool Use Capability for Tabular Data Analysis in Finance

Towards Tool Use Alignment of Large Language Models

Improving Tool Retrieval by Leveraging Large Language Models for Query Generation

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Enhancing Tool Retrieval with Iterative Feedback from Large Language Models