Abstract:The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at <a class="link-external link-https" href="https://github.com/FMInference/FlexGen" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the high computational and memory demands of large language model (LLM) inference, enabling efficient operation of these models even in resource-constrained environments (e.g., a single ordinary GPU). Specifically, the paper focuses on batch processing in latency-insensitive tasks, where higher throughput can be achieved by sacrificing latency, thereby reducing resource requirements. ### Background and Challenges 1. **High Resource Demand**: Large language models (such as GPT-175B) require a substantial amount of GPU memory to load model weights. For example, GPT-175B needs 325GB of GPU memory, typically necessitating multiple high-end accelerators. 2. **Latency-Insensitive Tasks**: Besides interactive applications (like chatbots), LLMs are widely used in many backend tasks such as benchmarking, information extraction, data curation, and form processing. These tasks usually require batch processing of large amounts of data and have low latency requirements. 3. **Limitations of Existing Methods**: - **Model Compression**: While it can reduce the memory footprint of the model, it usually assumes the model can fit entirely into GPU memory, making it difficult to run large-scale models on a single ordinary GPU. - **Collaborative Inference**: Decentralizes the inference cost but still suffers from inefficient resource utilization. - **Offloading Techniques**: Utilize CPU and disk memory, but existing systems have low throughput on a single GPU due to inefficient I/O scheduling and tensor placement. ### Solution The paper proposes FlexGen, an efficient generation inference engine capable of running large language models on a single ordinary GPU. The main contributions of FlexGen include: 1. **Efficient Offloading Strategies**: - **Multi-Level Memory Aggregation**: FlexGen can be flexibly configured to aggregate memory and computational resources from GPU, CPU, and disk to run LLMs. - **Linear Programming Algorithm**: Optimizes the storage and access patterns of tensors by solving a linear programming problem to improve throughput. 2. **Effective Compression Strategies**: - **4-bit Quantization**: Compresses weights and attention caches to 4 bits with minimal impact on accuracy, reducing I/O costs and memory footprint. 3. **Performance Improvements**: - When running OPT-175B on a single 16GB GPU, FlexGen achieves a generation throughput of 1 token/s with an effective batch size of 144. - In the HELM benchmark, FlexGen can complete the testing of a 30B model in 21 hours using a 16GB GPU across 7 representative sub-scenarios. ### Experimental Results - **Comparison with Existing Systems**: - **DeepSpeed Zero-Inference**: At the same latency (5000 seconds), FlexGen's effective batch size is 64 (total 2048 tokens), while DeepSpeed Zero-Inference's effective batch size is only 1 (total 32 tokens). - **Hugging Face Accelerate**: Unable to complete a single batch. - **Allowing Higher Latency (12000 seconds)**: FlexGen's maximum throughput is 69 times higher than the baseline system, with an effective batch size of 256 (total 8192 tokens), whereas DeepSpeed Zero-Inference and Hugging Face Accelerate cannot use larger batch sizes due to memory limitations. - **4-bit Compression**: At 4000 seconds latency, FlexGen can achieve 100 times the maximum throughput, with an effective batch size of 144 (total 4608 tokens), with all weights stored in the CPU, eliminating the need for disk offloading. ### Conclusion FlexGen significantly improves the throughput of running large language models on a single ordinary GPU through efficient offloading and compression strategies, providing a new solution for high-performance inference in resource-constrained environments.

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

High-throughput Generative Inference of Large Language Models with a Single GPU

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Efficient LLM inference solution on Intel GPU

RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

FlashDecoding++: Faster Large Language Model Inference on GPUs

Efficient Large-Scale Language Model Training on GPU Clusters

Flextron: Many-in-One Flexible Large Language Model

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Inference Performance Optimization for Large Language Models on CPUs

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads