Abstract:Being more powerful and intrusive into user-device interactions, LLMs are eager for on-device execution to better preserve user privacy. In this work, we propose a new paradigm of mobile AI: LLM as a system service on mobile devices (LLMaaS). Unlike traditional DNNs that execute in a stateless manner, such a system service is stateful: LLMs execution often needs to maintain persistent states (mainly KV cache) across multiple invocations. To minimize the LLM context switching overhead under tight device memory budget, this work presents LLMS, which decouples the memory management of app and LLM contexts with a key idea of fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. By fully leveraging KV cache's unique characteristics, it proposes three novel techniques: (1) Tolerance-Aware Compression: it compresses chunks based on their measured accuracy tolerance to compression. (2) IO-Recompute Pipelined Loading: it introduces recompute to swapping-in for acceleration. (3) Chunk Lifecycle Management: it optimizes the memory activities of chunks with an ahead-of-time swapping-out and an LCTRU (Least Compression-Tolerable and Recently-Used) queue based eviction. In evaluations conducted on well-established traces and various edge devices, \sys reduces context switching latency by up to 2 orders of magnitude when compared to competitive baseline solutions.

What problem does this paper attempt to address?

The paper aims to address the main challenges faced by large language models (LLMs) when running as system services on mobile devices, particularly how to efficiently manage the context of LLMs (mainly referring to the KV cache). Specifically: 1. **Proposing a new paradigm of LLMaaS (LLM as a Service)**: The paper proposes the idea of exposing LLM and its inference infrastructure as a service of the mobile operating system to applications. This is different from the traditional approach where each application has its own model, and the operating system is unaware of it. This way, more intelligent and personalized assistant functions can be achieved, and user privacy can be better protected. 2. **Addressing the issue of LLM context management**: Unlike the traditional stateless DNN execution, LLM needs to maintain persistent state (mainly the KV cache) during multiple calls, which poses new requirements for memory management. The paper identifies this unique system challenge and designs a system called LLMS to solve this problem, aiming to minimize the overhead of LLM context switching under a limited memory budget. 3. **Proposing three key technologies**: - **Tolerance-Aware Compression**: Compressing each data block to different extents based on its information density to maximize overall information strength while meeting the global average compression ratio configuration. - **Swapping-Recompute Pipeline**: Accelerating the context switching process by recomputing data blocks from the original text and overlapping it with I/O time. - **Chunk Lifecycle Management**: Using a strategy based on the Least Recently Used and Least Compressed Tolerant (LCTRU) queue to determine when and which data blocks should be swapped out, thereby optimizing the quality of context switching. Through these technologies, LLMS can significantly reduce context switching latency on commercial devices while maintaining high accuracy, thus promoting the development of LLM services on mobile devices.

LLM as a System Service on Mobile Devices

ELMS: Elasticized Large Language Models On Mobile Devices

Mobile Edge Intelligence for Large Language Models: A Contemporary Survey

Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Efficient Deployment of Large Language Model Across Cloud-Device Systems

LLMCad: Fast and Scalable On-device Large Language Model Inference

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions

MELTing point: Mobile Evaluation of Language Transformers

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

WiP: Efficient LLM Prefilling with Mobile NPU

LLMs as On-demand Customizable Service

Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management

Rethinking Mobile AI Ecosystem in the LLM Era

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

New Solutions on LLM Acceleration, Optimization, and Application

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Enabling On-Device LLMs Personalization with Smartphone Sensing

Herding LLaMaS: Using LLMs as an OS Module