Abstract:Large Language Models (LLMs) have achieved remarkable success in serving end-users with human-like intelligence. However, LLMs demand high computational resources, making it challenging to deploy them to satisfy various performance objectives, such as meeting the resource constraints on edge devices close to end-users or achieving high accuracy with ample resources. In this paper, we introduce CE-CoLLM, a novel cloud-edge collaboration framework that supports efficient and adaptive LLM inference for end-users at the edge with two modes, (1) low-latency edge standalone inference and (2) highly accurate cloud-edge collaborative inference. First, we show that the inherent high communication costs for transmitting LLM contextual information between the edge and cloud dominate the overall latency, making it inefficient and costly to deploy LLMs using cloud-edge collaboration. Second, we propose several critical techniques to address this challenge, including early-exit mechanism, cloud context manager, and quantization in cloud-edge collaboration to enable not only low-latency standalone edge inference but also efficient and adaptive cloud-edge collaborative inference for LLMs. Third, we perform comprehensive experimental analysis, which demonstrates that CE-CoLLM significantly reduces inference time by up to 13.81% and cloud computation costs by up to 84.55% compared to the popular cloud-based LLM deployment, while maintaining comparable model accuracy. The proposed approach effectively shifts the computational load to the edge, reduces the communication overhead, scales efficiently with multiple edge clients, and provides reliable LLM deployment using cloud-edge collaboration.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although large - language models (LLMs) have achieved remarkable success in serving end - users, their demand for computational resources is very high, which makes it difficult to deploy these models to meet various performance goals. Specifically, the main challenges include: 1. **Resource limitations on edge devices**: Edge devices are close to end - users, but directly deploying full - size LLMs on these devices faces significant challenges because the computing, memory, and storage resources of these devices are limited. 2. **High latency and privacy issues in cloud - side deployment**: Fully deploying LLMs to the cloud will lead to high latency, especially during data transmission. In addition, transmitting private user data through the public Internet increases the risk of privacy leakage. 3. **Communication overhead in cloud - edge collaboration**: Existing cloud - edge collaboration methods require frequent exchange of a large amount of intermediate - state data, resulting in high communication overhead and inference latency. To solve these problems, the paper proposes a new cloud - edge collaboration framework - CE - CoLLM (Cloud - Edge Collaborative Large Language Model), aiming to provide efficient and adaptable LLM inference in the following ways: - **Low - latency edge - independent inference mode**: Allows edge devices to perform efficient LLM inference without relying on the cloud, thereby reducing latency and protecting privacy. - **High - precision cloud - edge collaborative inference mode**: Dynamically decides whether cloud - side support is required according to the prediction confidence, ensuring the accuracy and efficiency of inference. - **Optimized cloud - edge communication mechanism**: By introducing an early - exit mechanism, a cloud - computing context manager, and quantization techniques, reduces communication overhead and improves inference efficiency. In summary, the goal of CE - CoLLM is to achieve efficient, fast, accurate, and adaptable LLM inference through cloud - edge collaboration, while reducing communication costs and improving system response speed.

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous Environment Adaptation

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

LLM-Cloud Complete: Leveraging Cloud Computing for Efficient Large Language Model-based Code Completion

Mobile Edge Intelligence for Large Language Models: A Contemporary Survey

Efficient and Economic Large Language Model Inference with Attention Offloading

ELMS: Elasticized Large Language Models On Mobile Devices

FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences.

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

FedCoLLM: A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way

Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly