CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Hongpeng Jin,Yanzhao Wu
2024-11-05
Abstract:Large Language Models (LLMs) have achieved remarkable success in serving end-users with human-like intelligence. However, LLMs demand high computational resources, making it challenging to deploy them to satisfy various performance objectives, such as meeting the resource constraints on edge devices close to end-users or achieving high accuracy with ample resources. In this paper, we introduce CE-CoLLM, a novel cloud-edge collaboration framework that supports efficient and adaptive LLM inference for end-users at the edge with two modes, (1) low-latency edge standalone inference and (2) highly accurate cloud-edge collaborative inference. First, we show that the inherent high communication costs for transmitting LLM contextual information between the edge and cloud dominate the overall latency, making it inefficient and costly to deploy LLMs using cloud-edge collaboration. Second, we propose several critical techniques to address this challenge, including early-exit mechanism, cloud context manager, and quantization in cloud-edge collaboration to enable not only low-latency standalone edge inference but also efficient and adaptive cloud-edge collaborative inference for LLMs. Third, we perform comprehensive experimental analysis, which demonstrates that CE-CoLLM significantly reduces inference time by up to 13.81% and cloud computation costs by up to 84.55% compared to the popular cloud-based LLM deployment, while maintaining comparable model accuracy. The proposed approach effectively shifts the computational load to the edge, reduces the communication overhead, scales efficiently with multiple edge clients, and provides reliable LLM deployment using cloud-edge collaboration.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Although large - language models (LLMs) have achieved remarkable success in serving end - users, their demand for computational resources is very high, which makes it difficult to deploy these models to meet various performance goals. Specifically, the main challenges include: 1. **Resource limitations on edge devices**: Edge devices are close to end - users, but directly deploying full - size LLMs on these devices faces significant challenges because the computing, memory, and storage resources of these devices are limited. 2. **High latency and privacy issues in cloud - side deployment**: Fully deploying LLMs to the cloud will lead to high latency, especially during data transmission. In addition, transmitting private user data through the public Internet increases the risk of privacy leakage. 3. **Communication overhead in cloud - edge collaboration**: Existing cloud - edge collaboration methods require frequent exchange of a large amount of intermediate - state data, resulting in high communication overhead and inference latency. To solve these problems, the paper proposes a new cloud - edge collaboration framework - CE - CoLLM (Cloud - Edge Collaborative Large Language Model), aiming to provide efficient and adaptable LLM inference in the following ways: - **Low - latency edge - independent inference mode**: Allows edge devices to perform efficient LLM inference without relying on the cloud, thereby reducing latency and protecting privacy. - **High - precision cloud - edge collaborative inference mode**: Dynamically decides whether cloud - side support is required according to the prediction confidence, ensuring the accuracy and efficiency of inference. - **Optimized cloud - edge communication mechanism**: By introducing an early - exit mechanism, a cloud - computing context manager, and quantization techniques, reduces communication overhead and improves inference efficiency. In summary, the goal of CE - CoLLM is to achieve efficient, fast, accurate, and adaptable LLM inference through cloud - edge collaboration, while reducing communication costs and improving system response speed.