PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Xiaozhe Ren,Pingyi Zhou,Xinfan Meng,Xinjing Huang,Yadao Wang,Weichao Wang,Pengfei Li,Xiaoda Zhang,Alexander Podolskiy,Grigory Arshinov,Andrey Bout,Irina Piontkovskaya,Jiansheng Wei,Xin Jiang,Teng Su,Qun Liu,Jun Yao
2023-03-20
Abstract:The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework, and present the language model with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation(ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the following key issues: 1. **Model Scalability**: - Design a scalable model architecture that can expand to a trillion-parameter scale without significantly increasing computational costs. - Address the issues of imbalanced workload and all-to-all communication latency in Mixture-of-Experts (MoE) models. - Design a high-performance and training-efficient sparse model with a trillion parameters. 2. **System Scalability**: - Build an efficient distributed training system capable of training large-scale language models with limited computational resources. - Reduce communication overhead between the host and accelerator devices through heterogeneous computing techniques, thereby improving training throughput. - Utilize existing hardware resources (such as the Ascend 910 accelerator) to train trillion-parameter models on moderately sized clusters. Through these efforts, the researchers developed a trillion-parameter language model named PanGu-Σ and demonstrated its zero-shot learning capabilities in various Chinese natural language processing downstream tasks. Additionally, the model showed excellent performance in applications such as dialogue, machine translation, and code generation.