Abstract:LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce the model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. Additionally, existing methods assume that the appropriate model for a user request is known in advance, which is not the case in practice. To this end, we introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs. To condense the number of bits required for describing the delta weights, we propose a salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance. Moreover, we develop a model-level routing method that efficiently directs user queries to the most suitable expert by performing domain classification. Extensive experiments show the promising memory efficiency and routing performance of ME-Switch. For example, when serving three models from the Mistral-7B family, ME-Switch reduces the model size by $1.74\times$ and maintains nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Notably, our method can efficiently serve 16 Mistral-7B models on a single NVIDIA A100 GPU.

PUZZLE: Efficiently Aligning Large Language Models Through Light-Weight Context Switch.

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

Mixture of In-Context Experts Enhance LLMs' Long Context Awareness

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

PURE: Aligning LLM Via Pluggable Query Reformulation for Enhanced Helpfulness

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Evolving Alignment via Asymmetric Self-Play

Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Orchestrating LLMs with Different Personalizations

ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation

TWOSOME: an Efficient Online Framework to Align LLMs with Embodied Environments Via Reinforcement Learning

Decoupled Alignment for Robust Plug-and-Play Adaptation

Secrets of RLHF in Large Language Models Part I: PPO

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy