ControlLLM: Augment Language Models with Tools by Searching on Graphs

Zhaoyang Liu,Zeqiang Lai,Zhangwei Gao,Erfei Cui,Ziheng Li,Xizhou Zhu,Lewei Lu,Qifeng Chen,Yu Qiao,Jifeng Dai,Wenhai Wang

2023-12-18

Abstract:We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a \textit{task decomposer} that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) paradigm} that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an \textit{execution engine with a rich toolbox} that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods. The code is at <a class="link-external link-https" href="https://github.com/OpenGVLab/ControlLLM" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

The paper aims to address the issues encountered by large language models (LLMs) when handling complex real-world tasks, especially when these tasks require support from cross-modal tools. Specifically, the research focuses on the following challenges: 1. **Ambiguous user prompts**: The instructions provided by users may not be clear and explicit, making it difficult for LLMs to understand their intent. 2. **Inaccurate tool selection and parameterization**: Even if the user prompts are clear, LLMs may fail due to the inability to correctly identify the required tools or set appropriate parameters. 3. **Inefficient tool scheduling**: For complex task workflows, LLMs may struggle to effectively arrange the sequence of tool usage. To address these issues, the paper proposes the ControlLLM framework, which comprises three key components: 1. **Task Decomposer**: Breaks down complex tasks into a series of subtasks with well-defined inputs and outputs. 2. **Thoughts-on-Graph (ToG) Paradigm**: Searches for the optimal solution path on a pre-constructed tool graph. This graph details the parameters and dependencies between different tools. 3. **Execution Engine**: Interprets the solution path and is capable of efficiently running the required tools on different computational devices. Through the collaborative work of these three components, ControlLLM significantly improves the accuracy, efficiency, and flexibility in solving complex tasks involving multimodal data such as images, audio, and video. Additionally, the paper constructs a benchmark test set to evaluate the performance of ControlLLM compared to other existing methods. Experimental results show that ControlLLM has a significantly higher success rate in handling complex tasks than baseline methods.

ControlLLM: Augment Language Models with Tools by Searching on Graphs

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Low-code LLM: Graphical User Interface over Large Language Models

LLM With Tools: A Survey

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

Tool Learning with Large Language Models: A Survey

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

VideoLLM: Modeling Video Sequence with Large Language Models

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage

InfMLLM: A Unified Framework for Visual-Language Tasks.

Towards Completeness-Oriented Tool Retrieval for Large Language Models

MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE