Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Zixu Hao,Huiqiang Jiang,Shiqi Jiang,Ju Ren,Ting Cao
DOI: https://doi.org/10.1145/3662006.3662067
2024-01-01
Abstract:Edge-Cloud collaboration for deep learning inference has been actively studied, to enhance the inference performance by leveraging both Edge and Cloud resources. However, traditional Edge-Cloud collaboration based on model partitioning or confidence score are not suitable in the LLM (large language models) era, because of its autoregressive generation and the generality across diverse tasks. This paper proposes a dynamic token-level Edge-Cloud collaboration for LLMs. A SLM (small language model) such as TinyLlama resides on the Edge devices, through token-level interaction with the Cloud-side LLMs during inference, approaching LLM quality with a controllable cost similar to SLM. Evaluation results show that our method can only use 25.8% LLM cost to achieve LLM-comparable quality on GSM8K task.
What problem does this paper attempt to address?