Abstract:Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of $59.4$ with merely $31$ billion average active parameters on Arena-Hard and a score of $9.06$ with $54$ billion average active parameters on MT-Bench.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges faced by large - language models (LLMs) in terms of scalability, cost, and customization. Specifically, although existing large monolithic LLMs have made remarkable progress in performance, their single - structure brings about the following problems: 1. **Scalability**: As the number of model parameters increases, the cost of training and inference rises sharply. 2. **Cost**: Large - scale LLMs require expensive computational resources, making maintenance and updates very costly. 3. **Customization**: Monolithic LLMs are difficult to be efficiently customized for specific tasks or domains, because fine - tuning these large models requires huge computational resources and is prone to catastrophic forgetting. To solve these problems, the authors propose a modular composite AI system named "Composition of Experts (CoE)". CoE achieves efficient resource utilization and performance improvement by integrating multiple expert LLMs and using a router to dynamically select the expert model that is most suitable for a given input. ### Main contributions 1. **Proposing the CoE framework**: A composite AI system consisting of multiple expert LLMs and a router is constructed, and the router is responsible for selecting the best expert model according to the input. 2. **Two - step routing method**: First, the input is classified into a limited number of categories, and then a specific expert model is selected based on the category. This method endows CoE with the required modular characteristics, making it easy to expand and maintain. 3. **Enhanced robustness**: Robust - CoE is introduced to adapt to new input distributions by quantifying uncertainty, improving the robustness of the system. 4. **Optimized training method**: A two - step training method is provided, including training the category router and the mapping from category to expert, and transforming it into a solvable mixed - integer linear programming (MILP) problem. 5. **Efficient implementation**: The unique three - level memory architecture of SambaNova SN40L is utilized to achieve the efficient deployment of CoE and reduce the inference cost. 6. **Empirical evaluation**: The effectiveness of CoE is verified through experiments, demonstrating its superior performance in benchmark tests such as Arena - Hard and MT - Bench. ### Key technologies of the solution - **Router design**: The router is the core component of CoE and is responsible for selecting the most appropriate expert model according to the input. To ensure the accuracy of the router's selection, the authors propose a two - step routing method, which first classifies the input and then selects the expert. - **Parameter budget constraint**: Under a given parameter budget, the selection of expert models is optimized to maximize performance and control costs. - **Data labeling and training**: A semi - supervised learning method is adopted to generate high - quality training data for training the router, ensuring that it can accurately distinguish the capabilities of different experts. Through these technological innovations, CoE not only improves performance but also significantly reduces the demand for computational resources, making the application of large - scale language models more flexible and economical.

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

CCoE: A Compact LLM with Collaboration of Experts

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Bench-CoE: a Framework for Collaboration of Experts from Benchmark

Chain-of-Experts: when LLMs Meet Complex Operations Research Problems

MoDEM: Mixture of Domain Expert Models

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Mixture of Experts for Network Optimization: A Large Language Model-enabled Approach

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach

MoIN: Mixture of Introvert Experts to Upcycle an LLM

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

Monet: Mixture of Monosemantic Experts for Transformers

A Closer Look into Mixture-of-Experts in Large Language Models

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Exploring Sparse Expert Models and Beyond

Scalable Multi-Domain Adaptation of Language Models using Modular Experts