Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong,Mengzhou Xia,Danqi Chen,Mike Lewis
2024-08-19
Abstract:Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of effectively scaling the Mixture-of-Experts (MoE) architecture in autoregressive language model pretraining. Specifically, traditional MoE models face the challenge of optimizing non-differentiable, discrete objectives when training the routing network. Although a fully differentiable MoE architecture (SMEAR) has been proposed recently, its effectiveness has only been validated in fine-tuning downstream classification tasks. Therefore, this paper proposes Lory, a new approach aimed at extending this fully differentiable MoE architecture to the pretraining of autoregressive language models. Lory achieves this goal by introducing two key techniques: 1. **Causal Fragment Routing Strategy**: This strategy efficiently performs expert merging operations while maintaining the autoregressive nature of the language model. 2. **Similarity-Based Data Batching Method**: This method encourages experts to specialize in specific domains or topics by grouping semantically similar documents. Experimental results show that the Lory model significantly improves performance over dense models with matched parameters, including perplexity (+13.9%) and performance on various downstream tasks (+1.5% to +11.1%). Additionally, despite using fragment-level routing, the Lory model is competitive in performance with state-of-the-art token-level routing-based MoE models. Further analysis also reveals that the trained experts can capture domain-level specialization without supervision. These results highlight the potential of fully differentiable MoE architectures in language model pretraining and advocate for future research in this area.