Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing

Ruilong Ma,Jingyu Wang,Qi,Xiang Yang,Haifeng Sun,Zirui Zhuang,Jianxin Liao
DOI: https://doi.org/10.1145/3603269.3610856
2023-01-01
Abstract:Large Language Models (LLMs) has fostered the creation of innovative requirements. Locally deployed LLMs for micro-enterprise mitigates potential issues such as privacy infringements and sluggish response. However, they are hampered by the limitations in computing capability and memory space of possessed devices. We introduce PipeLLM, which allocates the model across devices commensurate with their computing capabilities. It enables the parallel execution of layers with slicing input sequence along the token dimension. PipeLLM demonstrates the potential to accelerate LLM inference with heterogeneity devices, offering a solution for LLM deployment in micro-enterprise hardware environment.
What problem does this paper attempt to address?