Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

Xiaoming Yuan,Weixuan Kong,Zhenyu Luo,Minrui Xu
DOI: https://doi.org/10.3390/electronics13112077
IF: 2.9
2024-05-28
Electronics
Abstract:Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.
engineering, electrical & electronic,physics, applied,computer science, information systems
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the difficulty in deploying large - language models (LLMs) on the Internet of Medical Things (IoMT) devices, especially in providing accurate medical question - and - answer (Q&A) services and meeting resource requirements. Specifically, existing LLMs face the following challenges when deployed on IoMT devices: 1. **Resource limitations**: Existing large - language models, due to their large number of parameters, require a great deal of computing resources, which makes them difficult to be directly deployed on resource - limited IoMT devices. 2. **Privacy protection**: Deploying LLMs on the server side brings risks of data leakage and unauthorized access, while direct deployment on IoMT devices can better protect user privacy. 3. **Accuracy**: Existing general - purpose LLMs may not be accurate enough when dealing with specific problems in the medical field because they lack a deep understanding of medical terms and context. To address these challenges, the paper proposes a new medical LLM based on the Mixture - of - Experts (MoE) architecture - MedMixtral 8x7B, and introduces an efficient inference offloading strategy to reduce memory usage, enabling it to be deployed on IoMT devices. In addition, the paper also analyzes the key factors affecting inference latency, including the inter - device connection methods, the location of the offloading server, and the disk speed, and proposes some strategies to reduce latency. Through these methods, the paper aims to improve the performance of LLMs in medical Q&A tasks while ensuring the security of user privacy.