Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Yechen Xu,Xinhao Kong,Tingjun Chen,Danyang Zhuo
2024-06-05
Abstract:The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.
Computation and Language,Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in large - language - model (LLM) services, when a request triggers an external - tool invocation, how to improve the efficiency of request completion and reduce latency through partial execution of the tool. Traditional methods usually regard the invocation of external tools as an independent and sequential process, which leads to an increase in request - completion time. The author proposes a new method, that is, to perform partial tool execution simultaneously during the LLM decoding process to optimize this process. Specifically, the paper identifies a new opportunity, that is, to partially execute the tool during the LLM decoding process, thereby reducing the total latency of request completion. To achieve this, the author designs a system named Conveyor, which optimizes the way of handling requests involving external tools. Conveyor contains two key design points: 1. **Tool Interface Design**: Provide an interface that enables tool developers to express opportunities for partial execution to the LLM service system. For example, a code interpreter can use the line - break character "\n" or the semicolon ";" as an indicator of partial - execution opportunities. 2. **Fine - Grained Scheduler**: Build a scheduler scheduled at the token - granularity, detect these partial - execution opportunities, and invoke the corresponding tools to minimize unnecessary blocking and improve performance. During the LLM decoding process, after Conveyor detects an indication of partial execution, it will invoke the tool and collect the results of the tool invocation for future pre - filling. Through these designs, Conveyor can significantly reduce the latency of request completion on multiple external tools, especially in code - generation, search, and planning tasks, with a latency improvement of up to 38.8%. However, in database - execution and calculator tools, the improvement effect of Conveyor is not obvious. The author also provides a mathematical analysis, explaining under what circumstances partial tool execution can bring performance improvement, and shows the consistency between the empirical evaluation and the analysis results.