DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Connor Holmes,Masahiro Tanaka,Michael Wyatt,Ammar Ahmad Awan,Jeff Rasley,Samyam Rajbhandari,Reza Yazdani Aminabadi,Heyang Qin,Arash Bakhtiari,Lev Kurilenko,Yuxiong He

2024-01-09

Abstract:The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.

Performance,Machine Learning

What problem does this paper attempt to address?

This paper aims to address the requirements for high - throughput and low - latency service systems encountered during the deployment and scaling of large - language models (LLMs). Specifically, existing frameworks struggle to provide consistent quality of service when handling workloads of long prompts, especially when dealing with longer context windows, which has become a significant challenge. For example, models and systems such as MPT - StoryWriter and DeepSpeed Ulysses support context windows of tens of thousands of words, but existing systems significantly increase latency when processing these long prompts, affecting the user experience and compliance with service - level agreements (SLAs). To this end, the paper introduces the DeepSpeed - FastGen system, which solves the above problems by introducing a new prompt and generation combination strategy - Dynamic SplitFuse. The Dynamic SplitFuse technique can break down long prompts into smaller parts and schedule these parts in multiple forward passes, performing the generation task only in the last pass. For short prompts, they will be combined to precisely fill the target token budget. This strategy not only improves the system's response speed but also increases the system's efficiency, reduces latency variation, and improves service consistency. Through these improvements, DeepSpeed - FastGen achieves up to a 2.3 - fold increase in effective throughput, a 2 - fold reduction in average latency, and up to a 3.7 - fold reduction in token - level tail latency compared to current state - of - the - art systems such as vLLM. These performance improvements have been verified under different models and hardware configurations.

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Fast distributed inference serving for large language models

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

LLMCad: Fast and Scalable On-device Large Language Model Inference

High-throughput Generative Inference of Large Language Models with a Single GPU

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

FlashDecoding++: Faster Large Language Model Inference on GPUs

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models

SPEED: Speculative Pipelined Execution for Efficient Decoding

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

New Solutions on LLM Acceleration, Optimization, and Application

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines