Abstract:Large Language Models (LLMs) have achieved remarkable success in serving end-users with human-like intelligence. However, LLMs demand high computational resources, making it challenging to deploy them to satisfy various performance objectives, such as meeting the resource constraints on edge devices close to end-users or achieving high accuracy with ample resources. In this paper, we introduce CE-CoLLM, a novel cloud-edge collaboration framework that supports efficient and adaptive LLM inference for end-users at the edge with two modes, (1) low-latency edge standalone inference and (2) highly accurate cloud-edge collaborative inference. First, we show that the inherent high communication costs for transmitting LLM contextual information between the edge and cloud dominate the overall latency, making it inefficient and costly to deploy LLMs using cloud-edge collaboration. Second, we propose several critical techniques to address this challenge, including early-exit mechanism, cloud context manager, and quantization in cloud-edge collaboration to enable not only low-latency standalone edge inference but also efficient and adaptive cloud-edge collaborative inference for LLMs. Third, we perform comprehensive experimental analysis, which demonstrates that CE-CoLLM significantly reduces inference time by up to 13.81% and cloud computation costs by up to 84.55% compared to the popular cloud-based LLM deployment, while maintaining comparable model accuracy. The proposed approach effectively shifts the computational load to the edge, reduces the communication overhead, scales efficiently with multiple edge clients, and provides reliable LLM deployment using cloud-edge collaboration.

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Efficient Deployment of Large Language Model Across Cloud-Device Systems

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Distributed Inference Performance Optimization for LLMs on CPUs

WebLLM: A High-Performance In-Browser LLM Inference Engine

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

LLMCad: Fast and Scalable On-device Large Language Model Inference

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

Efficient and Economic Large Language Model Inference with Attention Offloading

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Efficient LLM inference solution on Intel GPU

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices

A Hardware Evaluation Framework for Large Language Model Inference