DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Connor Holmes,Masahiro Tanaka,Michael Wyatt,Ammar Ahmad Awan,Jeff Rasley,Samyam Rajbhandari,Reza Yazdani Aminabadi,Heyang Qin,Arash Bakhtiari,Lev Kurilenko,Yuxiong He
2024-01-09
Abstract:The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.
Performance,Machine Learning
What problem does this paper attempt to address?
This paper aims to address the requirements for high - throughput and low - latency service systems encountered during the deployment and scaling of large - language models (LLMs). Specifically, existing frameworks struggle to provide consistent quality of service when handling workloads of long prompts, especially when dealing with longer context windows, which has become a significant challenge. For example, models and systems such as MPT - StoryWriter and DeepSpeed Ulysses support context windows of tens of thousands of words, but existing systems significantly increase latency when processing these long prompts, affecting the user experience and compliance with service - level agreements (SLAs). To this end, the paper introduces the DeepSpeed - FastGen system, which solves the above problems by introducing a new prompt and generation combination strategy - Dynamic SplitFuse. The Dynamic SplitFuse technique can break down long prompts into smaller parts and schedule these parts in multiple forward passes, performing the generation task only in the last pass. For short prompts, they will be combined to precisely fill the target token budget. This strategy not only improves the system's response speed but also increases the system's efficiency, reduces latency variation, and improves service consistency. Through these improvements, DeepSpeed - FastGen achieves up to a 2.3 - fold increase in effective throughput, a 2 - fold reduction in average latency, and up to a 3.7 - fold reduction in token - level tail latency compared to current state - of - the - art systems such as vLLM. These performance improvements have been verified under different models and hardware configurations.