Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

Xiaoming Yuan,Weixuan Kong,Zhenyu Luo,Minrui Xu

DOI: https://doi.org/10.3390/electronics13112077

IF: 2.9

2024-05-28

Electronics

Abstract:Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.

engineering, electrical & electronic,physics, applied,computer science, information systems

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the difficulty in deploying large - language models (LLMs) on the Internet of Medical Things (IoMT) devices, especially in providing accurate medical question - and - answer (Q&A) services and meeting resource requirements. Specifically, existing LLMs face the following challenges when deployed on IoMT devices: 1. **Resource limitations**: Existing large - language models, due to their large number of parameters, require a great deal of computing resources, which makes them difficult to be directly deployed on resource - limited IoMT devices. 2. **Privacy protection**: Deploying LLMs on the server side brings risks of data leakage and unauthorized access, while direct deployment on IoMT devices can better protect user privacy. 3. **Accuracy**: Existing general - purpose LLMs may not be accurate enough when dealing with specific problems in the medical field because they lack a deep understanding of medical terms and context. To address these challenges, the paper proposes a new medical LLM based on the Mixture - of - Experts (MoE) architecture - MedMixtral 8x7B, and introduces an efficient inference offloading strategy to reduce memory usage, enabling it to be deployed on IoMT devices. In addition, the paper also analyzes the key factors affecting inference latency, including the inter - device connection methods, the location of the offloading server, and the disk speed, and proposes some strategies to reduce latency. Through these methods, the paper aims to improve the performance of LLMs in medical Q&A tasks while ensuring the security of user privacy.

Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Models with Sparse Mixture of Low-Rank Adapter Experts

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

MoE-Infinity: Offloading-Efficient MoE Model Serving

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

MedAide: Leveraging Large Language Models for On-Premise Medical Assistance on Edge Devices

MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation

Guiding IoT-Based Healthcare Alert Systems with Large Language Models

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings