Abstract:The capabilities of large language models (LLMs) in text comprehension and generation are advancing artificial intelligence. However, the growing number of parameters and computational demands challenge the efficient deployment of inference services. High-performance GPU clusters in the cloud can meet these requirements but incur high service costs and network stability issues, which struggle to meet service-level agreements (SLAs). The “cloud-device collaboration” approach leverages the heterogeneous hardware on both the cloud and device sides to satisfy SlAs efficiently. However, the varying operational intensity among different LLM operators and their dynamic nature complicate load scheduling for cloud-device systems. To address these challenges, we optimize LLM inference deployment on cloud-device systems through three aspects: scheduling algorithm, hardware modeling, and compilation deployment. For the scheduling algorithm, we analyze the LLM computation network, evaluate the computation-to-memory access ratio under different sequence lengths, and propose a greedy algorithm-based operator-level scheduling strategy. For the hardware modeling, we establish a relationship between operational intensity and GPU resource utilization to estimate operator running time. Finally, we designed a cloud-device LLM compiler framework for quantitative evaluation and efficient deployment across various hardware combinations and inference tasks. In specific inference scenarios, our framework satisfies the need for inference latency and achieves an average cost reduction of $20.7 \%$ compared to cloud-side-only inference.

Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Efficient Deployment of Large Language Model Across Cloud-Device Systems

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Efficient and Economic Large Language Model Inference with Attention Offloading

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Understanding LLMs: A Comprehensive Overview from Training to Inference

Distributed Inference Performance Optimization for LLMs on CPUs

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

LLM Inference Unveiled: Survey and Roofline Model Insights

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

LLMs as On-demand Customizable Service

Demystifying Platform Requirements for Diverse LLM Inference Use Cases

PermLLM: Private Inference of Large Language Models within 3 Seconds under WAN