ControlLLM: Augment Language Models with Tools by Searching on Graphs

Zhaoyang Liu,Zeqiang Lai,Zhangwei Gao,Erfei Cui,Ziheng Li,Xizhou Zhu,Lewei Lu,Qifeng Chen,Yu Qiao,Jifeng Dai,Wenhai Wang
2023-12-18
Abstract:We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a \textit{task decomposer} that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) paradigm} that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an \textit{execution engine with a rich toolbox} that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods. The code is at <a class="link-external link-https" href="https://github.com/OpenGVLab/ControlLLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper aims to address the issues encountered by large language models (LLMs) when handling complex real-world tasks, especially when these tasks require support from cross-modal tools. Specifically, the research focuses on the following challenges: 1. **Ambiguous user prompts**: The instructions provided by users may not be clear and explicit, making it difficult for LLMs to understand their intent. 2. **Inaccurate tool selection and parameterization**: Even if the user prompts are clear, LLMs may fail due to the inability to correctly identify the required tools or set appropriate parameters. 3. **Inefficient tool scheduling**: For complex task workflows, LLMs may struggle to effectively arrange the sequence of tool usage. To address these issues, the paper proposes the ControlLLM framework, which comprises three key components: 1. **Task Decomposer**: Breaks down complex tasks into a series of subtasks with well-defined inputs and outputs. 2. **Thoughts-on-Graph (ToG) Paradigm**: Searches for the optimal solution path on a pre-constructed tool graph. This graph details the parameters and dependencies between different tools. 3. **Execution Engine**: Interprets the solution path and is capable of efficiently running the required tools on different computational devices. Through the collaborative work of these three components, ControlLLM significantly improves the accuracy, efficiency, and flexibility in solving complex tasks involving multimodal data such as images, audio, and video. Additionally, the paper constructs a benchmark test set to evaluate the performance of ControlLLM compared to other existing methods. Experimental results show that ControlLLM has a significantly higher success rate in handling complex tasks than baseline methods.