DeepSeek-AI,Aixin Liu,Bei Feng,Bing Xue,Bingxuan Wang,Bochao Wu,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenyu Zhang,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fucong Dai,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Han Bao,Hanwei Xu,Haocheng Wang,Haowei Zhang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Li,Hui Qu,J.L. Cai,Jian Liang,Jianzhong Guo,Jiaqi Ni,Jiashi Li,Jiawei Wang,Jin Chen,Jingchang Chen,Jingyang Yuan,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kai Hu,Kaige Gao,Kang Guan,Kexin Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Xu,Leyi Xia,Liang Zhao,Litong Wang,Liyue Zhang,Meng Li,Miaojun Wang,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingming Li,Ning Tian,Panpan Huang,Peiyi Wang,Peng Zhang,Qiancheng Wang,Qihao Zhu,Qinyu Chen,Qiushi Du,R.J. Chen,R.L. Jin,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runxin Xu,Ruoyu Zhang,Ruyi Chen,S.S. Li,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaoqing Wu,Shengfeng Ye,Shengfeng Ye,Shirong Ma,Shiyu Wang,Shuang Zhou,Shuiping Yu,Shunfeng Zhou,Shuting Pan,T. Wang,Tao Yun,Tian Pei,Tianyu Sun,W.L. Xiao,Wangding Zeng,et al. (100 additional authors not shown)

Abstract:We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at <a class="link-external link-https" href="https://github.com/deepseek-ai/DeepSeek-V3" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve include: 1. **Improving the performance and efficiency of large - scale language models**: - DeepSeek - V3 is a Mixture - of - Experts (MoE) language model with 671B parameters, in which 37B parameters are activated for each token. To achieve efficient inference and cost - effective training, DeepSeek - V3 adopts Multi - head Latent Attention (MLA) and DeepSeekMoE architectures. - The paper introduces an auxiliary - loss - free load - balancing strategy to reduce the negative impact on model performance caused by load - balancing. 2. **Innovation in multi - token prediction training objectives**: - DeepSeek - V3 first proposes the training objective of Multi - Token Prediction (MTP), aiming to improve data efficiency by increasing the density of training signals and enabling the model to pre - plan its representation, so as to better predict future tokens. 3. **Efficient and economical training methods**: - To achieve efficient training, the paper supports FP8 mixed - precision training and comprehensively optimizes the training framework. Low - precision training has been proven to be a promising solution for efficient training. - Through the support of FP8 calculation and storage, the training is accelerated and the GPU memory usage is reduced. In addition, the DualPipe algorithm is designed to achieve efficient pipeline parallelization, and an efficient cross - node all - communication kernel is developed to fully utilize the InfiniBand (IB) and NVLink bandwidth. 4. **Optimization in the pre - training and post - training stages**: - In the pre - training stage, DeepSeek - V3 is trained on 14.8 trillion high - quality and diverse tokens. The entire pre - training process is very stable, without encountering any unrecoverable loss peaks or rollbacks. - After pre - training, two stages of context length expansion are carried out, and the potential of the model is further unlocked through Supervised Fine - Tuning (SFT) and Reinforcement Learning (RL), making it more in line with human preferences. 5. **Evaluation and comparison**: - Comprehensive evaluations show that, despite the economical training cost, DeepSeek - V3 performs excellently in multiple benchmark tests, especially in the fields of code and mathematics. Its chat version also outperforms other open - source models and achieves performance comparable to leading closed - source models (such as GPT - 4o and Claude - 3.5 - Sonnet) in a series of standard and open - ended benchmark tests. In summary, this paper is committed to improving the performance and efficiency of large - scale language models through innovative architectures, training objectives and optimization methods while maintaining a relatively low training cost.

DeepSeek-V3 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

GRIN: GRadient-INformed MoE

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

OLMoE: Open Mixture-of-Experts Language Models

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More