FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning

Fenglong Cai,Dong Yuan,Zhe Yang,Yonghui Xu,Wei He,Wei Guo,Lizhen Cui
DOI: https://doi.org/10.1016/j.parco.2024.103114
IF: 0.983
2024-10-12
Parallel Computing
Abstract:Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.
computer science, theory & methods
What problem does this paper attempt to address?