WiP: Efficient LLM Prefilling with Mobile NPU

Daliang Xu,Hao Zhang,Liming Yang,Ruiqi Liu,Mengwei Xu,Xuanzhe Liu
DOI: https://doi.org/10.1145/3662006.3662066
2024-01-01
Abstract:Large language models (LLMs) play a crucial role in various Natural Language Processing (NLP) tasks, prompting their deployment on mobile devices for inference. However, a significant challenge arises due to high waiting latency, especially for long prompts. This paper introduces mllm-NPU, the first system enabling efficient on-device LLM prefilling acceleration using on-chip Neural Processing Units (NPUs). Despite the impressive compute capabilities of NPUs, direct application to LLM prefilling often falls short. To this end, mllm-NPU incorporates two key techniques: (1) chunk-wise CPU-NPU co-scheduling to handle static compute graphs and INT8-only acceleration problems. (2) dynamic outlier inference to deal with static activation quantization sacrificing accuracy problem.
What problem does this paper attempt to address?