Abstract:The capabilities of large language models (LLMs) in text comprehension and generation are advancing artificial intelligence. However, the growing number of parameters and computational demands challenge the efficient deployment of inference services. High-performance GPU clusters in the cloud can meet these requirements but incur high service costs and network stability issues, which struggle to meet service-level agreements (SLAs). The “cloud-device collaboration” approach leverages the heterogeneous hardware on both the cloud and device sides to satisfy SlAs efficiently. However, the varying operational intensity among different LLM operators and their dynamic nature complicate load scheduling for cloud-device systems. To address these challenges, we optimize LLM inference deployment on cloud-device systems through three aspects: scheduling algorithm, hardware modeling, and compilation deployment. For the scheduling algorithm, we analyze the LLM computation network, evaluate the computation-to-memory access ratio under different sequence lengths, and propose a greedy algorithm-based operator-level scheduling strategy. For the hardware modeling, we establish a relationship between operational intensity and GPU resource utilization to estimate operator running time. Finally, we designed a cloud-device LLM compiler framework for quantitative evaluation and efficient deployment across various hardware combinations and inference tasks. In specific inference scenarios, our framework satisfies the need for inference latency and achieves an average cost reduction of $20.7 \%$ compared to cloud-side-only inference.

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Can Large-Language Models Help us Better Understand and Teach the Development of Energy-Efficient Software?

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Efficient and Economic Large Language Model Inference with Attention Offloading

Decentralized LLM Inference over Edge Networks with Energy Harvesting

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Large Language Models for Energy-Efficient Code: Emerging Results and Future Directions

Efficient Deployment of Large Language Model Across Cloud-Device Systems

LLM-based Frameworks for Power Engineering from Routine to Novel Tasks

Towards Pareto Optimal Throughput in Small Language Model Serving

Distributed Inference Performance Optimization for LLMs on CPUs

All Language Models Large and Small

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Fast Distributed Inference Serving for Large Language Models

Efficient LLM Scheduling by Learning to Rank