EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Mingjin Zhang,Jiannong Cao,Xiaoming Shen,Zeyang Cui

2024-05-23

Abstract:Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.

Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The paper aims to address the issues of high latency, high bandwidth costs, and privacy concerns faced by large language models (LLMs) in cloud computing. Specifically, the authors propose a framework called EdgeShard, which leverages collaborative edge computing to optimize the LLM inference process. By partitioning the LLM model into multiple shards and deploying them across different edge devices and cloud servers, EdgeShard can significantly reduce inference latency and increase throughput. Additionally, this approach takes into account the computational capabilities and memory constraints of heterogeneous devices, as well as the quality of network connections between devices. The main contributions of the paper include: 1. Proposing a general LLM inference framework that supports collaborative inference between heterogeneous edge devices and cloud servers. 2. Conducting a quantitative study on how to select computing devices and how to partition the LLM to achieve optimal performance, and proposing a dynamic programming algorithm to optimize latency and throughput respectively. 3. Evaluating the performance of EdgeShard on a real testbed compared to the state-of-the-art Llama2 serial model, showing significant improvements in both latency and throughput. Overall, EdgeShard aims to address the various challenges of deploying LLMs in the cloud by fully leveraging edge computing resources.

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Edge Collaborative Learning Acceleration Based on Latency Prediction

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Decentralized LLM Inference over Edge Networks with Energy Harvesting

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach

Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

Efficient and Economic Large Language Model Inference with Attention Offloading

Distributed Inference Performance Optimization for LLMs on CPUs

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques

EdgeLD: Locally Distributed Deep Learning Inference on Edge Device Clusters