FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng,Lianmin Zheng,Binhang Yuan,Zhuohan Li,Max Ryabinin,Daniel Y. Fu,Zhiqiang Xie,Beidi Chen,Clark Barrett,Joseph E. Gonzalez,Percy Liang,Christopher Ré,Ion Stoica,Ce Zhang
DOI: https://doi.org/10.48550/arXiv.2303.06865
2023-06-12
Abstract:The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at <a class="link-external link-https" href="https://github.com/FMInference/FlexGen" rel="external noopener nofollow">this https URL</a>
Machine Learning,Artificial Intelligence,Performance
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the high computational and memory demands of large language model (LLM) inference, enabling efficient operation of these models even in resource-constrained environments (e.g., a single ordinary GPU). Specifically, the paper focuses on batch processing in latency-insensitive tasks, where higher throughput can be achieved by sacrificing latency, thereby reducing resource requirements. ### Background and Challenges 1. **High Resource Demand**: Large language models (such as GPT-175B) require a substantial amount of GPU memory to load model weights. For example, GPT-175B needs 325GB of GPU memory, typically necessitating multiple high-end accelerators. 2. **Latency-Insensitive Tasks**: Besides interactive applications (like chatbots), LLMs are widely used in many backend tasks such as benchmarking, information extraction, data curation, and form processing. These tasks usually require batch processing of large amounts of data and have low latency requirements. 3. **Limitations of Existing Methods**: - **Model Compression**: While it can reduce the memory footprint of the model, it usually assumes the model can fit entirely into GPU memory, making it difficult to run large-scale models on a single ordinary GPU. - **Collaborative Inference**: Decentralizes the inference cost but still suffers from inefficient resource utilization. - **Offloading Techniques**: Utilize CPU and disk memory, but existing systems have low throughput on a single GPU due to inefficient I/O scheduling and tensor placement. ### Solution The paper proposes FlexGen, an efficient generation inference engine capable of running large language models on a single ordinary GPU. The main contributions of FlexGen include: 1. **Efficient Offloading Strategies**: - **Multi-Level Memory Aggregation**: FlexGen can be flexibly configured to aggregate memory and computational resources from GPU, CPU, and disk to run LLMs. - **Linear Programming Algorithm**: Optimizes the storage and access patterns of tensors by solving a linear programming problem to improve throughput. 2. **Effective Compression Strategies**: - **4-bit Quantization**: Compresses weights and attention caches to 4 bits with minimal impact on accuracy, reducing I/O costs and memory footprint. 3. **Performance Improvements**: - When running OPT-175B on a single 16GB GPU, FlexGen achieves a generation throughput of 1 token/s with an effective batch size of 144. - In the HELM benchmark, FlexGen can complete the testing of a 30B model in 21 hours using a 16GB GPU across 7 representative sub-scenarios. ### Experimental Results - **Comparison with Existing Systems**: - **DeepSpeed Zero-Inference**: At the same latency (5000 seconds), FlexGen's effective batch size is 64 (total 2048 tokens), while DeepSpeed Zero-Inference's effective batch size is only 1 (total 32 tokens). - **Hugging Face Accelerate**: Unable to complete a single batch. - **Allowing Higher Latency (12000 seconds)**: FlexGen's maximum throughput is 69 times higher than the baseline system, with an effective batch size of 256 (total 8192 tokens), whereas DeepSpeed Zero-Inference and Hugging Face Accelerate cannot use larger batch sizes due to memory limitations. - **4-bit Compression**: At 4000 seconds latency, FlexGen can achieve 100 times the maximum throughput, with an effective batch size of 144 (total 4608 tokens), with all weights stored in the CPU, eliminating the need for disk offloading. ### Conclusion FlexGen significantly improves the throughput of running large language models on a single ordinary GPU through efficient offloading and compression strategies, providing a new solution for high-performance inference in resource-constrained environments.