Teola: Towards End-to-End Optimization of LLM-based Applications

Xin Tan,Yimin Jiang,Yitao Yang,Hong Xu

2024-06-29

Abstract:Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The paper aims to address the issue of end-to-end performance optimization in the application of large language models (LLMs). Although there have been many efforts to optimize LLM inference, the optimization of the overall workflow has been largely overlooked. Existing frameworks adopt a coarse-grained orchestration approach, modularizing tasks, which limits joint optimization between modules and results in suboptimal scheduling decisions. The paper proposes a fine-grained end-to-end orchestration method, using task primitives as basic units, representing the workflow of each query as a data flow graph at the primitive level. This approach exposes a larger design space, making it possible to parallelize and pipeline across different modules, and enhances overall application performance through improved scheduling. The authors have built a new orchestration framework called Teola to implement this approach. Experimental results show that Teola can achieve up to 2.09 times speedup compared to existing systems in various popular LLM applications. Specifically, the main contributions of Teola include: 1. Identifying the limitations of current LLM orchestration frameworks, namely that coarse-grained module-level orchestration limits optimization potential and there is a mismatch between request-level scheduling and end-to-end application performance. 2. Proposing a fine-grained orchestration method that represents query workflows as data flow graphs based on primitives, thereby expanding the design space for end-to-end optimization, including graph optimizations (such as parallelization and pipelining) and application-aware scheduling. 3. Designing and implementing Teola, demonstrating the feasibility and advantages of this method. Experimental results prove that Teola outperforms existing systems in popular applications.

Teola: Towards End-to-End Optimization of LLM-based Applications

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Optimizing High-Level Synthesis Designs with Retrieval-Augmented Large Language Models

Meta-programming for cross-domain tensor optimizations

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

City-LEO: Toward Transparent City Management Using LLM with End-to-End Optimization

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning.

Autonomous Multi-Objective Optimization Using Large Language Model

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

A Framework to Implement 1+N Multi-task Fine-tuning Pattern in LLMs Using the CGC-LORA Algorithm

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking

ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline