Efficient Deployment of Large Language Model Across Cloud-Device Systems

Fan Yang,Zehao Wang,Haoyu Zhang,Zhenhua Zhu,Xinhao Yang,Guohao Dai,Yu Wang
DOI: https://doi.org/10.1109/socc62300.2024.10737825
2024-01-01
Abstract:The capabilities of large language models (LLMs) in text comprehension and generation are advancing artificial intelligence. However, the growing number of parameters and computational demands challenge the efficient deployment of inference services. High-performance GPU clusters in the cloud can meet these requirements but incur high service costs and network stability issues, which struggle to meet service-level agreements (SLAs). The “cloud-device collaboration” approach leverages the heterogeneous hardware on both the cloud and device sides to satisfy SlAs efficiently. However, the varying operational intensity among different LLM operators and their dynamic nature complicate load scheduling for cloud-device systems. To address these challenges, we optimize LLM inference deployment on cloud-device systems through three aspects: scheduling algorithm, hardware modeling, and compilation deployment. For the scheduling algorithm, we analyze the LLM computation network, evaluate the computation-to-memory access ratio under different sequence lengths, and propose a greedy algorithm-based operator-level scheduling strategy. For the hardware modeling, we establish a relationship between operational intensity and GPU resource utilization to estimate operator running time. Finally, we designed a cloud-device LLM compiler framework for quantitative evaluation and efficient deployment across various hardware combinations and inference tasks. In specific inference scenarios, our framework satisfies the need for inference latency and achieves an average cost reduction of $20.7 \%$ compared to cloud-side-only inference.
What problem does this paper attempt to address?