Abstract:Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can shrink model sizes but often impair accuracy, making them unsuitable for practical applications. In this work, we introduce \modelname{}, a high-performance inference engine designed to speed up LLM inference without compromising model accuracy. \modelname{} incorporates three innovative methods to increase inference efficiency: 1) model partitioning to allow asynchronous processing of tasks across CPU computation, GPU computation, and CPU-GPU communication, 2) an adaptive partition algorithm to optimize the use of CPU, GPU, and PCIe communication capabilities, and 3) a token assignment strategy to handle diverse prompt and generation tasks during LLM inference. Comprehensive experiments were conducted with various LLMs such as Mixtral, LLaMA-2, Qwen, and PhiMoE across three test environments featuring different CPUs and GPUs. The experimental findings demonstrate that \modelname{} achieves speeds between $1.11\times$ to $1.80\times$ faster in decoding and $1.69\times$ to $6.33\times$ faster in pre-filling, leading to an overall speedup ranging from $1.25\times$ to $2.04\times$ compared to state-of-the-art solutions, <a class="link-external link-http" href="http://llama.cpp" rel="external noopener nofollow">this http URL</a> and Fiddler.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficient inference of large - language models (LLMs) on medium - performance GPU systems. Specifically: 1. **High resource requirements**: Due to their large model sizes, large - language models have high requirements for computing resources and memory, which makes it difficult to perform efficient inference on medium - performance GPU systems. 2. **Limitations of existing methods**: Existing techniques such as quantization or pruning can reduce the model size, but usually damage the model's accuracy, making them unsuitable for practical applications. 3. **Multi - resource optimization**: Current methods usually only utilize one resource (such as CPU or GPU) and fail to fully utilize multiple resources in the system (including CPU, GPU, and PCIe communication capabilities), resulting in inefficiency. To solve these problems, the paper introduces ScheInfer, a high - performance inference engine, which aims to accelerate the inference process of large - language models without sacrificing the model's accuracy. ScheInfer improves inference efficiency through the following three innovative methods: 1. **Model partitioning**: Allows asynchronous processing of tasks between CPU computing, GPU computing, and CPU - GPU communication. 2. **Adaptive partitioning algorithm**: Optimizes the use of CPU, GPU, and PCIe communication capabilities. 3. **Token allocation strategy**: Handles the requirements of different prompt and generation tasks in the LLM inference process. The paper verifies the effectiveness of ScheInfer through experiments on multiple LLMs (such as Mixtral, LLaMA - 2, Qwen, and PhiMoE) and in different test environments. The experimental results show that ScheInfer is 1.11 to 1.80 times faster than existing solutions in the decoding phase, 1.69 to 6.33 times faster in the pre - filling phase, and the overall speed improvement ranges from 1.25 to 2.04 times.

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Inference Performance Optimization for Large Language Models on CPUs

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Distributed Inference Performance Optimization for LLMs on CPUs

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Efficient LLM inference solution on Intel GPU

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Fast Distributed Inference Serving for Large Language Models

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

A Survey on Efficient Inference for Large Language Models

GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

High-throughput Generative Inference of Large Language Models with a Single GPU

Self-Selected Attention Span for Accelerating Large Language Model Inference