EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Mingjin Zhang,Jiannong Cao,Xiaoming Shen,Zeyang Cui
2024-05-23
Abstract:Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper aims to address the issues of high latency, high bandwidth costs, and privacy concerns faced by large language models (LLMs) in cloud computing. Specifically, the authors propose a framework called EdgeShard, which leverages collaborative edge computing to optimize the LLM inference process. By partitioning the LLM model into multiple shards and deploying them across different edge devices and cloud servers, EdgeShard can significantly reduce inference latency and increase throughput. Additionally, this approach takes into account the computational capabilities and memory constraints of heterogeneous devices, as well as the quality of network connections between devices. The main contributions of the paper include: 1. Proposing a general LLM inference framework that supports collaborative inference between heterogeneous edge devices and cloud servers. 2. Conducting a quantitative study on how to select computing devices and how to partition the LLM to achieve optimal performance, and proposing a dynamic programming algorithm to optimize latency and throughput respectively. 3. Evaluating the performance of EdgeShard on a real testbed compared to the state-of-the-art Llama2 serial model, showing significant improvements in both latency and throughput. Overall, EdgeShard aims to address the various challenges of deploying LLMs in the cloud by fully leveraging edge computing resources.