Xingwu Sun,Yanfeng Chen,Yiqing Huang,Ruobing Xie,Jiaqi Zhu,Kai Zhang,Shuaipeng Li,Zhen Yang,Jonny Han,Xiaobo Shu,Jiahao Bu,Zhongzhi Chen,Xuemeng Huang,Fengzong Lian,Saiyong Yang,Jianfeng Yan,Yuyuan Zeng,Xiaoqin Ren,Chao Yu,Lulu Wu,Yue Mao,Jun Xia,Tao Yang,Suncong Zheng,Kan Wu,Dian Jiao,Jinbao Xue,Xipeng Zhang,Decheng Wu,Kai Liu,Dengpeng Wu,Guanghui Xu,Shaohua Chen,Shuang Chen,Xiao Feng,Yigeng Hong,Junqiang Zheng,Chengcheng Xu,Zongwei Li,Xiong Kuang,Jianglu Hu,Yiqi Chen,Yuchi Deng,Guiyang Li,Ao Liu,Chenchen Zhang,Shihui Hu,Zilong Zhao,Zifan Wu,Yao Ding,Weichao Wang,Han Liu,Roberts Wang,Hao Fei,Peijie She,Ze Zhao,Xun Cao,Hai Wang,Fusheng Xiang,Mengyuan Huang,Zhiyuan Xiong,Bin Hu,Xuebin Hou,Lei Jiang,Jiajia Wu,Yaping Deng,Yi Shen,Qian Wang,Weijie Liu,Jie Liu,Meng Chen,Liang Dong,Weiwen Jia,Hu Chen,Feifei Liu,Rui Yuan,Huilin Xu,Zhenxiang Yan,Tengfei Cao,Zhichao Hu,Xinhua Feng,Dong Du,Tinghao She,Yangyu Tao,Feng Zhang,Jianchen Zhu,Chengzhong Xu,Xirui Li,Chong Zha,Wen Ouyang,Yinben Xia,Xiang Li,Zekun He,Rongpeng Chen,Jiawei Song,Ruibin Chen,Fan Jiang,Chongqing Zhao,Bo Wang,Hao Gong,et al. (7 additional authors not shown)

Abstract:In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: <a class="link-external link-https" href="https://github.com/Tencent/Hunyuan-Large" rel="external noopener nofollow">this https URL</a> Models: <a class="link-external link-https" href="https://huggingface.co/tencent/Tencent-Hunyuan-Large" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper introduces the large-scale open-source mixture of experts (MoE) model developed by Tencent Hunyuan team—Hunyuan-Large. Hunyuan-Large has a total of 389 billion parameters and 52 billion active parameters, capable of handling up to 256K tokens. The main goals of the paper are: 1. **Enhancing the performance of large-scale language models**: - Hunyuan-Large performs excellently in multiple benchmarks, including language understanding, generation, logical reasoning, mathematical problem-solving, programming, long-context processing, and comprehensive tasks. It outperforms LLama3.1-70B in these tasks and is comparable to the larger LLama3.1-405B model. 2. **Exploring and optimizing the technology of mixture of experts models**: - The paper details the key technological innovations of Hunyuan-Large, including the use of high-quality synthetic data, enhanced model structures (such as key-value cache compression, cyclic routing, and expert-specific learning rate strategies), and the study of the scaling laws of MoE models. 3. **Promoting community development**: - By open-sourcing the code and model checkpoints of Hunyuan-Large, it promotes the dissemination and application of technology. This not only aids academic research but also provides powerful tools for enterprises and developers. 4. **Addressing the challenges of large-scale model training and inference**: - By introducing efficient cache compression techniques and expert-specific learning rate strategies, it reduces memory pressure and inference costs, improving the efficiency of model training and inference. ### Key Technologies - **High-quality synthetic data**: By generating and filtering high-quality synthetic data, it enriches the diversity and quality of training data, enabling the model to better generalize to unseen data. - **Enhanced model structures**: Utilizing key-value cache compression, cyclic routing, and expert-specific learning rate strategies to optimize the efficiency of model training and inference. - **Exploration of MoE scaling laws**: Studying the scaling laws of MoE models provides valuable insights for future model development and optimization. ### Experimental Results - **Extensive experiments**: Conducting extensive experiments on various benchmarks to validate the performance of Hunyuan-Large on different tasks. - **Best performance**: Among the existing open-source LLMs of similar scale, Hunyuan-Large performs the best, especially in common sense understanding, Q&A, mathematical reasoning, programming, and comprehensive tasks. ### Conclusion Hunyuan-Large not only achieves top-level performance but also introduces multiple technological innovations, providing valuable references for future MoE model development. Through open-sourcing, Hunyuan-Large will further promote research and application in the field of artificial intelligence.

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Exploring Sparse Expert Models and Beyond

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

HMoE: Heterogeneous Mixture of Experts for Language Modeling

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Yuan 2.0-M32: Mixture of Experts with Attention Router

Monet: Mixture of Monosemantic Experts for Transformers

YuLan: An Open-source Large Language Model

A Closer Look into Mixture-of-Experts in Large Language Models

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Upcycling Large Language Models into Mixture of Experts

OLMoE: Open Mixture-of-Experts Language Models