PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

Xiaozhe Ren,Pingyi Zhou,Xinfan Meng,Xinjing Huang,Yadao Wang,Weichao Wang,Pengfei Li,Xiaoda Zhang,Alexander Podolskiy,Grigory Arshinov,Andrey Bout,Irina Piontkovskaya,Jiansheng Wei,Xin Jiang,Teng Su,Qun Liu,Jun Yao

2023-03-20

Abstract:The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework, and present the language model with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation(ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.

Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the following key issues: 1. **Model Scalability**: - Design a scalable model architecture that can expand to a trillion-parameter scale without significantly increasing computational costs. - Address the issues of imbalanced workload and all-to-all communication latency in Mixture-of-Experts (MoE) models. - Design a high-performance and training-efficient sparse model with a trillion parameters. 2. **System Scalability**: - Build an efficient distributed training system capable of training large-scale language models with limited computational resources. - Reduce communication overhead between the host and accelerator devices through heterogeneous computing techniques, thereby improving training throughput. - Utilize existing hardware resources (such as the Ascend 910 accelerator) to train trillion-parameter models on moderately sized clusters. Through these efforts, the researchers developed a trillion-parameter language model named PanGu-Σ and demonstrated its zero-shot learning capabilities in various Chinese natural language processing downstream tasks. Additionally, the model showed excellent performance in applications such as dialogue, machine translation, and code generation.

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Optimizing Distributed Training on Frontier for Large Language Models

PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation

Rethinking Optimization and Architecture for Tiny Language Models

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

Exploring Sparse Expert Models and Beyond

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

PaLM: Scaling Language Modeling with Pathways

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Language models scale reliably with over-training and on downstream tasks

PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Training Compute-Optimal Large Language Models

BaGuaLu: targeting brain scale pretrained models with over 37 million cores

Scaling Expert Language Models with Unsupervised Domain Discovery

CPM-2: Large-scale Cost-effective Pre-trained Language Models