Mixtral of Experts

Albert Q. Jiang,Alexandre Sablayrolles,Antoine Roux,Arthur Mensch,Blanche Savary,Chris Bamford,Devendra Singh Chaplot,Diego de las Casas,Emma Bou Hanna,Florian Bressand,Gianna Lengyel,Guillaume Bour,Guillaume Lample,Lélio Renard Lavaud,Lucile Saulnier,Marie-Anne Lachaux,Pierre Stock,Sandeep Subramanian,Sophia Yang,Szymon Antoniak,Teven Le Scao,Théophile Gervet,Thibaut Lavril,Thomas Wang,Timothée Lacroix,William El Sayed

2024-01-09

Abstract:We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The paper mainly introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral 8x7B addresses the issues of existing models in the following aspects: 1. **Performance Improvement**: Mixtral 8x7B surpasses or matches the performance of Llama 2 70B and GPT-3.5 in multiple benchmarks, particularly excelling in tasks related to mathematics, code generation, and multilingual understanding. 2. **Parameter Utilization**: Although each token is processed using only 2 experts, the routing network's selection of different experts allows each token to access more parameters (47B), while the actual active parameters used are only 13B. 3. **Multilingual Support**: By increasing the proportion of multilingual data, Mixtral 8x7B significantly outperforms Llama 2 70B in multilingual benchmarks such as French, German, Spanish, and Italian. 4. **Long Sequence Processing Capability**: Mixtral 8x7B can efficiently handle contexts up to 32k tokens and performs very stably in long sequence tasks. 5. **Instruction-Tuned Model**: Mixtral 8x7B also offers an instruction-tuned version (Mixtral 8x7B – Instruct), which surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B in human evaluation benchmarks. Overall, the paper aims to demonstrate a new type of SMoE architecture that achieves excellent performance in various tasks through efficient parameter utilization and multilingual support.

Mixtral of Experts

Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

OLMoE: Open Mixture-of-Experts Language Models

Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

Pixtral 12B

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

RegMix: Data Mixture as Regression for Language Model Pre-training

MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Monet: Mixture of Monosemantic Experts for Transformers

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

From Sparse to Soft Mixtures of Experts

AutoMix: Automatically Mixing Language Models

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

MoIN: Mixture of Introvert Experts to Upcycle an LLM

Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs