RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices

Wei Zhang,Nan Guan,Chun Jason Xue,Juxin Niu
DOI: https://doi.org/10.1109/RTCSA62462.2024.00013
2024-08-21
Abstract:While large language models (LLMs) are usually deployed on powerful servers, there is growing interest in deploying them on local machines for better real-time performance, service stability, privacy, and flexibility. Unfortunately, the GPU memory on local machines is often insufficient to accommodate the entire LLM. Although running an LLM on such a GPU device is still possible by swapping data between the limited GPU memory and the abundant main memory, the slow speed of data swapping significantly hampers inference time, rendering it impractical in reality. In this paper, we propose RTiL, a systematic solution to address the above challenge. RTiL utilizes collaborative inference, which combines a lightweight LLM with the default powerful LLM. The lightweight LLM generates output tokens, which are then validated for quality by the powerful LLM. This approach allows RTiL to significantly speed up inference while maintaining the same output quality as when using the powerful LLM alone. Additionally, by delegating part of the inference workload to the CPU and optimizing data movement between main and GPU memory, we further enhance the efficiency of the inference process. Furthermore, we extend RTiL to handle requests with real-time requirements, enabling it to meet such demands by slightly trading off output quality. Through extensive experiments, we demonstrate notable improvements in inference efficiency and the ability to fulfill real-time requirements while minimizing degradation in output quality.
Computer Science
What problem does this paper attempt to address?