Abstract:The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in large - language - model (LLM) services, when a request triggers an external - tool invocation, how to improve the efficiency of request completion and reduce latency through partial execution of the tool. Traditional methods usually regard the invocation of external tools as an independent and sequential process, which leads to an increase in request - completion time. The author proposes a new method, that is, to perform partial tool execution simultaneously during the LLM decoding process to optimize this process. Specifically, the paper identifies a new opportunity, that is, to partially execute the tool during the LLM decoding process, thereby reducing the total latency of request completion. To achieve this, the author designs a system named Conveyor, which optimizes the way of handling requests involving external tools. Conveyor contains two key design points: 1. **Tool Interface Design**: Provide an interface that enables tool developers to express opportunities for partial execution to the LLM service system. For example, a code interpreter can use the line - break character "\n" or the semicolon ";" as an indicator of partial - execution opportunities. 2. **Fine - Grained Scheduler**: Build a scheduler scheduled at the token - granularity, detect these partial - execution opportunities, and invoke the corresponding tools to minimize unnecessary blocking and improve performance. During the LLM decoding process, after Conveyor detects an indication of partial execution, it will invoke the tool and collect the results of the tool invocation for future pre - filling. Through these designs, Conveyor can significantly reduce the latency of request completion on multiple external tools, especially in code - generation, search, and planning tasks, with a latency improvement of up to 38.8%. However, in database - execution and calculator tools, the improvement effect of Conveyor is not obvious. The author also provides a mathematical analysis, explaining under what circumstances partial tool execution can bring performance improvement, and shows the consistency between the empirical evaluation and the analysis results.

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

ControlLLM: Augment Language Models with Tools by Searching on Graphs

An LLM-Tool Compiler for Fused Parallel Function Calling

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction

Efficient and Economic Large Language Model Inference with Attention Offloading

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

Fairness in Serving Large Language Models

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Large Language Models as Tool Makers

Efficient LLM Scheduling by Learning to Rank

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving

LLM With Tools: A Survey

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution