Teola: Towards End-to-End Optimization of LLM-based Applications

Xin Tan,Yimin Jiang,Yitao Yang,Hong Xu
2024-06-29
Abstract:Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper aims to address the issue of end-to-end performance optimization in the application of large language models (LLMs). Although there have been many efforts to optimize LLM inference, the optimization of the overall workflow has been largely overlooked. Existing frameworks adopt a coarse-grained orchestration approach, modularizing tasks, which limits joint optimization between modules and results in suboptimal scheduling decisions. The paper proposes a fine-grained end-to-end orchestration method, using task primitives as basic units, representing the workflow of each query as a data flow graph at the primitive level. This approach exposes a larger design space, making it possible to parallelize and pipeline across different modules, and enhances overall application performance through improved scheduling. The authors have built a new orchestration framework called Teola to implement this approach. Experimental results show that Teola can achieve up to 2.09 times speedup compared to existing systems in various popular LLM applications. Specifically, the main contributions of Teola include: 1. Identifying the limitations of current LLM orchestration frameworks, namely that coarse-grained module-level orchestration limits optimization potential and there is a mismatch between request-level scheduling and end-to-end application performance. 2. Proposing a fine-grained orchestration method that represents query workflows as data flow graphs based on primitives, thereby expanding the design space for end-to-end optimization, including graph optimizations (such as parallelization and pipelining) and application-aware scheduling. 3. Designing and implementing Teola, demonstrating the feasibility and advantages of this method. Experimental results prove that Teola outperforms existing systems in popular applications.