Toward Inference-optimal Mixture-of-Expert Large Language Models

Longfei Yun,Yonghao Zhuang,Yao Fu,Eric P Xing,Hao Zhang
2024-04-04
Abstract:Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.
Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore how to optimize the number of experts in Mixture of Experts (MoE) large language models (LLMs) to balance model performance, training cost, and inference efficiency. Specifically, the paper investigates the following issues: 1. **Relationship between model performance and scale**: How does model performance change with increasing model scale? Particularly, is the performance improvement still significant when increasing the number of experts? 2. **Relationship between training cost and model scale**: Given a training budget, how to optimally allocate model scale and training data volume? 3. **Impact on inference efficiency**: What is the impact of increasing the number of experts on the model's inference efficiency? How to reduce inference costs while maintaining high performance? ### Main Contributions 1. **Extending existing scaling laws**: Extending the existing dense model scaling laws to MoE models by introducing the number of experts as a new variable, revealing the relationship between validation loss, model scale, training data volume, and the number of experts. 2. **Introducing inference cost as a key metric**: Introducing inference cost as an important metric for evaluating model performance, in addition to traditional validation loss, and proposing a new budget allocation method that comprehensively considers model quality and practical resource constraints. 3. **Proposing overtraining configuration**: The study shows that training smaller models on larger datasets can significantly reduce inference costs while maintaining high performance. This overtraining configuration is more efficient in practical applications. ### Research Background - **Mixture of Experts (MoE) models**: MoE models use a routing mechanism to assign input tokens to different experts for processing, allowing the model's parameter scale to expand without significantly increasing computational costs. - **Scaling laws**: Existing research has established scaling laws for dense models, describing the relationship between model scale, training data volume, and validation loss. However, these studies mostly ignore the impact of the number of experts on model performance. - **Inference efficiency**: In practical applications, inference efficiency is an important consideration. During inference, the model needs to store a large number of intermediate states (such as KV cache), which can occupy a lot of memory and affect inference speed and cost. ### Experimental Setup and Results - **Experimental setup**: Researchers trained a series of models with different scales and numbers of experts and tested them on datasets of varying sizes. - **Results analysis**: The study shows that increasing the number of experts can indeed improve model performance, but the performance improvement diminishes beyond a certain threshold. Additionally, increasing the number of experts significantly increases inference costs. Therefore, the paper proposes a new budget allocation strategy that achieves a balance between performance and inference efficiency by training smaller models on larger datasets. ### Conclusion By extending existing scaling laws and introducing inference cost as an evaluation metric, this paper proposes a more comprehensive budget allocation method. The study shows that in practical applications, training smaller models on larger datasets can significantly reduce inference costs while maintaining high performance. This finding provides important guidance for the design and optimization of MoE models.