Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong,Mengzhou Xia,Danqi Chen,Mike Lewis

2024-08-19

Abstract:Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of effectively scaling the Mixture-of-Experts (MoE) architecture in autoregressive language model pretraining. Specifically, traditional MoE models face the challenge of optimizing non-differentiable, discrete objectives when training the routing network. Although a fully differentiable MoE architecture (SMEAR) has been proposed recently, its effectiveness has only been validated in fine-tuning downstream classification tasks. Therefore, this paper proposes Lory, a new approach aimed at extending this fully differentiable MoE architecture to the pretraining of autoregressive language models. Lory achieves this goal by introducing two key techniques: 1. **Causal Fragment Routing Strategy**: This strategy efficiently performs expert merging operations while maintaining the autoregressive nature of the language model. 2. **Similarity-Based Data Batching Method**: This method encourages experts to specialize in specific domains or topics by grouping semantically similar documents. Experimental results show that the Lory model significantly improves performance over dense models with matched parameters, including perplexity (+13.9%) and performance on various downstream tasks (+1.5% to +11.1%). Additionally, despite using fragment-level routing, the Lory model is competitive in performance with state-of-the-art token-level routing-based MoE models. Further analysis also reveals that the trained experts can capture domain-level specialization without supervision. These results highlight the potential of fully differentiable MoE architectures in language model pretraining and advocate for future research in this area.

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Layerwise Recurrent Router for Mixture-of-Experts

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

A Closer Look into Mixture-of-Experts in Large Language Models

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models.

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models

MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

LocMoE: A Low-Overhead MoE for Large Language Model Training

Higher Layers Need More LoRA Experts

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Mixture of Diverse Size Experts