Abstract:The advent of large language models (LLMs) revolutionized natural language processing applications, and running LLMs on edge devices has become increasingly attractive for reasons including reduced latency, data localization, and personalized user experiences. This comprehensive review examines the challenges of deploying computationally expensive LLMs on resource-constrained devices and explores innovative solutions across multiple domains. The paper investigates the development of on-device language models, their efficient architectures, including parameter sharing and modular designs, as well as state-of-the-art compression techniques like quantization, pruning, and knowledge distillation. Hardware acceleration strategies and collaborative edge-cloud deployment approaches are analyzed, highlighting the intricate balance between performance and resource utilization. Case studies of on-device language models from major mobile manufacturers demonstrate real-world applications and potential benefits. The review also addresses critical aspects such as adaptive learning, multi-modal capabilities, and personalization. By identifying key research directions and open challenges, this paper provides a roadmap for future advancements in on-device language models, emphasizing the need for interdisciplinary efforts to realize the full potential of ubiquitous, intelligent computing while ensuring responsible and ethical deployment. For a comprehensive review of research work and educational resources on on-device large language models (LLMs), please visit <a class="link-external link-https" href="https://github.com/NexaAI/Awesome-LLMs-on-device" rel="external noopener nofollow">this https URL</a>. To download and run on-device LLMs, visit <a class="link-external link-https" href="https://www.nexaai.com/models" rel="external noopener nofollow">this https URL</a>.

SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation

Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions

LLMCad: Fast and Scalable On-device Large Language Model Inference

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis

Cloud-Device Collaborative Learning for Multimodal Large Language Models

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

ELMS: Elasticized Large Language Models On Mobile Devices

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Enabling On-Device LLMs Personalization with Smartphone Sensing

On-Device Language Models: A Comprehensive Review

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

On-device query intent prediction with lightweight LLMs to support ubiquitous conversations

Extremely Low Footprint End-to-End ASR System for Smart Device

Adaptive Pruning for Large Language Models with Structural Importance Awareness

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding