Abstract:6G's AI native vision of embedding advance intelligence in the network while bringing it closer to the user requires a systematic evaluation of Generative AI (GenAI) models on edge devices. Rapidly emerging solutions based on Open RAN (ORAN) and Network-in-a-Box strongly advocate the use of low-cost, off-the-shelf components for simpler and efficient deployment, e.g., in provisioning rural connectivity. In this context, conceptual architecture, hardware testbeds and precise performance quantification of Large Language Models (LLMs) on off-the-shelf edge devices remains largely unexplored. This research investigates computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN. We investigate various LLMs, including small, medium and large models, on a Raspberry Pi 5 Cluster using a lightweight Kubernetes distribution (K3s) with modular prompting implementation. We study its feasibility and limitations by analyzing throughput, latency, accuracy and efficiency. Our findings indicate that CPU-only deployment of lightweight models, such as Yi, Phi, and Llama3, can effectively support edge applications, achieving a generation throughput of 5 to 12 tokens per second with less than 50\% CPU and RAM usage. We conclude that GenAI on the edge offers localized inference in remote or bandwidth-constrained environments in 6G networks without reliance on cloud infrastructure.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the deployment performance of large - language models (LLMs) on edge devices, especially in resource - constrained environments. Specifically, the research focuses on the following aspects: 1. **Conceptual Architecture and Hardware Testbed**: The paper proposes a conceptual architecture based on a low - cost Raspberry Pi cluster and uses a lightweight Kubernetes distribution (K3s) for deployment to support LLM inference in edge environments. 2. **Performance Quantification**: The research evaluates the performance of different - sized LLMs on edge devices by analyzing key metrics such as throughput, latency, accuracy, and efficiency. In particular, the study explores the effectiveness of CPU - only deployment of lightweight models (such as Yi, Phi, and Llama3), which can achieve a generation throughput of 5 to 12 tokens per second on resource - constrained edge devices while the CPU and RAM usage rates are less than 50%. 3. **Feasibility and Limitations**: The paper discusses in detail the feasibility and limitations of deploying LLMs on edge devices, especially for application scenarios in low - bandwidth or remote environments in 6G networks. The research finds that lightweight models can provide localized inference capabilities in these environments without relying on cloud infrastructure. 4. **Practical Applications**: The research also explores the practical applications of LLMs on edge devices, such as natural - language processing (NLP), real - time translation, personalized assistance, and optimizing network performance through dynamic traffic prediction and resource management. Overall, this paper aims to provide guidance and reference for the future deployment and optimization of LLMs on resource - constrained edge devices through systematic evaluation and experimentation.

Generative AI on the Edge: Architecture and Performance Evaluation

Toward Democratized Generative AI in Next-Generation Mobile Edge Networks

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

Generative AI as a Service in 6G Edge-Cloud: Generation Task Offloading by In-context Learning

Enabling Distributed Generative Artificial Intelligence in 6G: Mobile Edge Generation

Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

From Cloud to Edge: Rethinking Generative AI for Low-Resource Design Challenges

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

NetGPT: An AI-Native Network Architecture for Provisioning Beyond Personalized Generative Services

An Overview on Generative AI at Scale with Edge-Cloud Computing

Large Language Models Empowered Autonomous Edge AI for Connected Intelligence

CLAN: Continuous Learning using Asynchronous Neuroevolution on Commodity Edge Devices

Large Generative AI Models meet Open Networks for 6G: Integration, Platform, and Monetization

Multi-Agent RL-Based Industrial AIGC Service Offloading over Wireless Edge Networks

Mobile Edge Generation: A New Era to 6G

An Edge-Cloud Collaboration Framework for Generative AI Service Provision with Synergetic Big Cloud Model and Small Edge Models

NetGPT:An AI-Native Network Architecture for Provisioning Beyond Personalized Generative Services