Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Wenxiang Lin,Xinglin Pan,Shaohuai Shi,Xuan Wang,Xiaowen Chu
2024-11-24
Abstract:Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can shrink model sizes but often impair accuracy, making them unsuitable for practical applications. In this work, we introduce \modelname{}, a high-performance inference engine designed to speed up LLM inference without compromising model accuracy. \modelname{} incorporates three innovative methods to increase inference efficiency: 1) model partitioning to allow asynchronous processing of tasks across CPU computation, GPU computation, and CPU-GPU communication, 2) an adaptive partition algorithm to optimize the use of CPU, GPU, and PCIe communication capabilities, and 3) a token assignment strategy to handle diverse prompt and generation tasks during LLM inference. Comprehensive experiments were conducted with various LLMs such as Mixtral, LLaMA-2, Qwen, and PhiMoE across three test environments featuring different CPUs and GPUs. The experimental findings demonstrate that \modelname{} achieves speeds between $1.11\times$ to $1.80\times$ faster in decoding and $1.69\times$ to $6.33\times$ faster in pre-filling, leading to an overall speedup ranging from $1.25\times$ to $2.04\times$ compared to state-of-the-art solutions, <a class="link-external link-http" href="http://llama.cpp" rel="external noopener nofollow">this http URL</a> and Fiddler.
Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the efficient inference of large - language models (LLMs) on medium - performance GPU systems. Specifically: 1. **High resource requirements**: Due to their large model sizes, large - language models have high requirements for computing resources and memory, which makes it difficult to perform efficient inference on medium - performance GPU systems. 2. **Limitations of existing methods**: Existing techniques such as quantization or pruning can reduce the model size, but usually damage the model's accuracy, making them unsuitable for practical applications. 3. **Multi - resource optimization**: Current methods usually only utilize one resource (such as CPU or GPU) and fail to fully utilize multiple resources in the system (including CPU, GPU, and PCIe communication capabilities), resulting in inefficiency. To solve these problems, the paper introduces ScheInfer, a high - performance inference engine, which aims to accelerate the inference process of large - language models without sacrificing the model's accuracy. ScheInfer improves inference efficiency through the following three innovative methods: 1. **Model partitioning**: Allows asynchronous processing of tasks between CPU computing, GPU computing, and CPU - GPU communication. 2. **Adaptive partitioning algorithm**: Optimizes the use of CPU, GPU, and PCIe communication capabilities. 3. **Token allocation strategy**: Handles the requirements of different prompt and generation tasks in the LLM inference process. The paper verifies the effectiveness of ScheInfer through experiments on multiple LLMs (such as Mixtral, LLaMA - 2, Qwen, and PhiMoE) and in different test environments. The experimental results show that ScheInfer is 1.11 to 1.80 times faster than existing solutions in the decoding phase, 1.69 to 6.33 times faster in the pre - filling phase, and the overall speed improvement ranges from 1.25 to 2.04 times.