Abstract:The distinction between humans and animals lies in the unique ability of humans to use and create tools. Tools empower humans to overcome physiological limitations, fostering the creation of magnificent civilizations. Similarly, enabling foundational models like Large Language Models (LLMs) with the capacity to learn external tool usage may serve as a pivotal step toward realizing artificial general intelligence. Previous studies in this field have predominantly pursued two distinct approaches to augment the tool invocation capabilities of LLMs. The first approach emphasizes the construction of relevant datasets for model fine-tuning. The second approach, in contrast, aims to fully exploit the inherent reasoning abilities of LLMs through in-context learning strategies. In this work, we introduce a novel tool invocation pipeline designed to control massive real-world APIs. This pipeline mirrors the human task-solving process, addressing complicated real-life user queries. At each step, we guide LLMs to summarize the achieved results and determine the next course of action. We term this pipeline `from Summary to action', Sum2Act for short. Empirical evaluations of our Sum2Act pipeline on the ToolBench benchmark show significant performance improvements, outperforming established methods like ReAct and DFSDT. This highlights Sum2Act's effectiveness in enhancing LLMs for complex real-world tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to enhance the ability of large - language models (LLMs) to handle complex real - world tasks, especially by means of open - world API calls to achieve this goal. Specifically, the paper introduces a new tool - calling framework named Sum2Act, which aims to control large - scale real - world APIs to solve complex user queries. This framework mimics the process by which humans solve problems, guiding LLMs at each step to summarize the results achieved and decide on the next action. In this way, Sum2Act can effectively improve the performance of LLMs in handling complex real - world tasks. ### Main Contributions 1. **Introduction of a new tool - calling framework**: This framework includes a router and a state manager, enabling large - language models to explicitly monitor task progress and correct errors. 2. **Experimental results show superiority**: In the ToolBench benchmark test, the performance of Sum2Act is better than existing baseline methods, such as CoT and DFSDT, especially when handling complex real - world tasks. 3. **Expansion of the use of visual APIs**: Sum2Act can also handle more diverse visual tasks by integrating open - world visual APIs. ### Method Overview - **Overall architecture**: Sum2Act utilizes large - language models and a wide range of open - world APIs to solve real - world tasks. It first uses a retriever to obtain tools (or APIs) related to user instructions, and then iterates between the action - proposal stage and the summary stage. - **Action - proposal stage**: The router plans the next action and executes it based on the current state, instructions, and available tools. If the task is not completed, the router will select a specific tool or API and perform the corresponding operation; if the task is completed, it will exit the loop and respond to the user's command. - **Summary stage**: The state manager evaluates the observations of these actions and updates the overall state accordingly. The state manager will check whether the new action successfully returns information related to the target task. If it is successful, it will record the new answer; otherwise, it will record the reason for failure and add it to the failure history. ### Experimental Results - **Evaluation through the ToolBench data set**: The experimental results show that Sum2Act performs excellently when handling complex tasks, especially exceeding existing methods in both the Pass Rate and Win Rate indicators. - **Case studies**: The effectiveness of Sum2Act is demonstrated through specific cases, such as successfully obtaining the version information of C - code compilers, YouTube video information, weather forecasts, and flight data, etc. ### Conclusion Sum2Act significantly improves the ability of large - language models to handle complex real - world tasks by introducing a new tool - calling framework. This method is not only superior to existing methods in performance but also shows strong practicality and flexibility in practical applications.

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs

TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

Large Language Models as Tool Makers

Learning to Program with Natural Language

MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation

Tool Learning with Large Language Models: A Survey

LLM With Tools: A Survey

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

On the Tool Manipulation Capability of Open-source Large Language Models

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

Empowering Large Language Model Agents through Action Learning

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios