DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI,Aixin Liu,Bei Feng,Bin Wang,Bingxuan Wang,Bo Liu,Chenggang Zhao,Chengqi Dengr,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Hanwei Xu,Hao Yang,Haowei Zhang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Li,Hui Qu,J.L. Cai,Jian Liang,Jianzhong Guo,Jiaqi Ni,Jiashi Li,Jin Chen,Jingyang Yuan,Junjie Qiu,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Lean Wang,Lecong Zhang,Lei Xu,Leyi Xia,Liang Zhao,Liyue Zhang,Meng Li,Miaojun Wang,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingming Li,Ning Tian,Panpan Huang,Peiyi Wang,Peng Zhang,Qihao Zhu,Qinyu Chen,Qiushi Du,R.J. Chen,R.L. Jin,Ruiqi Ge,Ruizhe Pan,Runxin Xu,Ruyi Chen,S.S. Li,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaoqing Wu,Shengfeng Ye,Shirong Ma,Shiyu Wang,Shuang Zhou,Shuiping Yu,Shunfeng Zhou,Size Zheng,T. Wang,Tian Pei,Tian Yuan,Tianyu Sun,W.L. Xiao,Wangding Zeng,Wei An,Wen Liu,Wenfeng Liang,Wenjun Gao,Wentao Zhang,X.Q. Li,Xiangyue Jin,Xianzu Wang,Xiao Bi,Xiaodong Liu,Xiaohan Wang,Xiaojin Shen,Xiaokang Chen,Xiaosha Chen,Xiaotao Nie,Xiaowen Sun,et al. (57 additional authors not shown)

2024-06-19

Abstract:We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily addresses the challenges of training costs and inference efficiency in Large Language Models (LLMs). Specifically, as the number of parameters increases, although the model's capabilities are enhanced, it also brings higher computational resource demands and potential decreases in inference throughput. To tackle these issues, the paper introduces DeepSeek-V2, a powerful open-source language model based on the Mixture-of-Experts (MoE) architecture. Features of DeepSeek-V2 include: 1. **Cost-effective training**: Through an innovative Transformer architecture, DeepSeek-V2 can reduce training costs while maintaining high performance. 2. **Efficient inference**: The model design significantly reduces the need for Key-Value (KV) caching during inference, thereby improving inference efficiency. To achieve these goals, the paper introduces two key technologies: - **Multi-head Latent Attention (MLA)**: This is a new attention mechanism that significantly reduces the size of KV caches through low-rank joint compression of keys and values, supporting efficient inference processes. - **DeepSeekMoE architecture**: This is an efficient Mixture-of-Experts architecture that allows powerful models to be trained at a lower cost. Additionally, the paper describes various settings during the pre-training process, including data construction, hyperparameter selection, infrastructure, etc., and conducts detailed evaluation experiments. The results demonstrate that even with only 21 billion parameters activated, DeepSeek-V2 can achieve top performance across various benchmarks, making it one of the strongest open-source MoE language models.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

OLMoE: Open Mixture-of-Experts Language Models

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

DeepSeek-VL: Towards Real-World Vision-Language Understanding

AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

HMoE: Heterogeneous Mixture of Experts for Language Modeling

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

GRIN: GRadient-INformed MoE

Toward Inference-optimal Mixture-of-Expert Large Language Models

A Closer Look into Mixture-of-Experts in Large Language Models

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models