CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

Jinrong Li,Biao Han,Sudan Li,Xiaoyan Wang,Jie Li
DOI: https://doi.org/10.1109/iccc62479.2024.10681712
2024-01-01
Abstract:Over the past few years, there has been notable advancement in Large Language Models (LLMs), leading to their extensive utilisation across various domains. Running large-scale LLMs usually necessitates processing capacity at the level of data-centers, which deters numerous potential applications from researchers. However, certain applications with great potential in LLM, such as the Internet of Things(IoT) data analysis and multi-robot collaboration, are typically constrained by lack of resources, specifically graphics processing units(GPUs). As a result, these devices fail to execute LLM inference. To tackle the aforementioned issues, we first investigate the problem of “Compute Bound” in devices with constrained resources, which are unavailable for hierarchical partitioning models. Furthermore, utilising the LLM tensor parallelization, we present a collaborative LLM inference framework on resource-constrained devices called CoLLM. In addition, we propose a minimal latency algorithm and an adaptive load balancing algorithm to optimize inference latency and balance energy consumption. (1) By considering the LLM model’s size, device resources, and network conditions, we can calculate the optimum number of collaborative devices to minimise inference latency. (2) CoLLM is capable of dynamically distributing computational workloads based on the target device’s status, balancing power consumption to extend overall working time. Experiments are conducted when the Llama2 model is executed on GPU-free devices such as Raspberry Pis. Evaluation results show that end-to-end inference speed outperforms current hierarchical LLM inference methods by a factor of $1.9 x-2.3 x$.
What problem does this paper attempt to address?