LLM as a System Service on Mobile Devices

Wangsong Yin,Mengwei Xu,Yuanchun Li,Xuanzhe Liu
2024-03-18
Abstract:Being more powerful and intrusive into user-device interactions, LLMs are eager for on-device execution to better preserve user privacy. In this work, we propose a new paradigm of mobile AI: LLM as a system service on mobile devices (LLMaaS). Unlike traditional DNNs that execute in a stateless manner, such a system service is stateful: LLMs execution often needs to maintain persistent states (mainly KV cache) across multiple invocations. To minimize the LLM context switching overhead under tight device memory budget, this work presents LLMS, which decouples the memory management of app and LLM contexts with a key idea of fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. By fully leveraging KV cache's unique characteristics, it proposes three novel techniques: (1) Tolerance-Aware Compression: it compresses chunks based on their measured accuracy tolerance to compression. (2) IO-Recompute Pipelined Loading: it introduces recompute to swapping-in for acceleration. (3) Chunk Lifecycle Management: it optimizes the memory activities of chunks with an ahead-of-time swapping-out and an LCTRU (Least Compression-Tolerable and Recently-Used) queue based eviction. In evaluations conducted on well-established traces and various edge devices, \sys reduces context switching latency by up to 2 orders of magnitude when compared to competitive baseline solutions.
Operating Systems
What problem does this paper attempt to address?
The paper aims to address the main challenges faced by large language models (LLMs) when running as system services on mobile devices, particularly how to efficiently manage the context of LLMs (mainly referring to the KV cache). Specifically: 1. **Proposing a new paradigm of LLMaaS (LLM as a Service)**: The paper proposes the idea of exposing LLM and its inference infrastructure as a service of the mobile operating system to applications. This is different from the traditional approach where each application has its own model, and the operating system is unaware of it. This way, more intelligent and personalized assistant functions can be achieved, and user privacy can be better protected. 2. **Addressing the issue of LLM context management**: Unlike the traditional stateless DNN execution, LLM needs to maintain persistent state (mainly the KV cache) during multiple calls, which poses new requirements for memory management. The paper identifies this unique system challenge and designs a system called LLMS to solve this problem, aiming to minimize the overhead of LLM context switching under a limited memory budget. 3. **Proposing three key technologies**: - **Tolerance-Aware Compression**: Compressing each data block to different extents based on its information density to maximize overall information strength while meeting the global average compression ratio configuration. - **Swapping-Recompute Pipeline**: Accelerating the context switching process by recomputing data blocks from the original text and overlapping it with I/O time. - **Chunk Lifecycle Management**: Using a strategy based on the Least Recently Used and Least Compressed Tolerant (LCTRU) queue to determine when and which data blocks should be swapped out, thereby optimizing the quality of context switching. Through these technologies, LLMS can significantly reduce context switching latency on commercial devices while maintaining high accuracy, thus promoting the development of LLM services on mobile devices.