Generative AI on the Edge: Architecture and Performance Evaluation

Zeinab Nezami,Maryam Hafeez,Karim Djemame,Syed Ali Raza Zaidi
2024-11-19
Abstract:6G's AI native vision of embedding advance intelligence in the network while bringing it closer to the user requires a systematic evaluation of Generative AI (GenAI) models on edge devices. Rapidly emerging solutions based on Open RAN (ORAN) and Network-in-a-Box strongly advocate the use of low-cost, off-the-shelf components for simpler and efficient deployment, e.g., in provisioning rural connectivity. In this context, conceptual architecture, hardware testbeds and precise performance quantification of Large Language Models (LLMs) on off-the-shelf edge devices remains largely unexplored. This research investigates computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN. We investigate various LLMs, including small, medium and large models, on a Raspberry Pi 5 Cluster using a lightweight Kubernetes distribution (K3s) with modular prompting implementation. We study its feasibility and limitations by analyzing throughput, latency, accuracy and efficiency. Our findings indicate that CPU-only deployment of lightweight models, such as Yi, Phi, and Llama3, can effectively support edge applications, achieving a generation throughput of 5 to 12 tokens per second with less than 50\% CPU and RAM usage. We conclude that GenAI on the edge offers localized inference in remote or bandwidth-constrained environments in 6G networks without reliance on cloud infrastructure.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence,Networking and Internet Architecture,Performance
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the deployment performance of large - language models (LLMs) on edge devices, especially in resource - constrained environments. Specifically, the research focuses on the following aspects: 1. **Conceptual Architecture and Hardware Testbed**: The paper proposes a conceptual architecture based on a low - cost Raspberry Pi cluster and uses a lightweight Kubernetes distribution (K3s) for deployment to support LLM inference in edge environments. 2. **Performance Quantification**: The research evaluates the performance of different - sized LLMs on edge devices by analyzing key metrics such as throughput, latency, accuracy, and efficiency. In particular, the study explores the effectiveness of CPU - only deployment of lightweight models (such as Yi, Phi, and Llama3), which can achieve a generation throughput of 5 to 12 tokens per second on resource - constrained edge devices while the CPU and RAM usage rates are less than 50%. 3. **Feasibility and Limitations**: The paper discusses in detail the feasibility and limitations of deploying LLMs on edge devices, especially for application scenarios in low - bandwidth or remote environments in 6G networks. The research finds that lightweight models can provide localized inference capabilities in these environments without relying on cloud infrastructure. 4. **Practical Applications**: The research also explores the practical applications of LLMs on edge devices, such as natural - language processing (NLP), real - time translation, personalized assistance, and optimizing network performance through dynamic traffic prediction and resource management. Overall, this paper aims to provide guidance and reference for the future deployment and optimization of LLMs on resource - constrained edge devices through systematic evaluation and experimentation.