GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng,Xiao Liu,Zhengxiao Du,Zihan Wang,Hanyu Lai,Ming Ding,Zhuoyi Yang,Yifan Xu,Wendi Zheng,Xiao Xia,Weng Lam Tam,Zixuan Ma,Yufei Xue,Jidong Zhai,Wenguang Chen,Peng Zhang,Yuxiao Dong,Jie Tang
2023-10-25
Abstract:We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{<a class="link-external link-https" href="https://github.com/THUDM/GLM-130B/" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is the development of an open-source, high-performance large language model (LLM), specifically targeting models with 100 billion parameters. Specifically, the paper introduces GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. The goals of the paper include: 1. **Improving Model Performance**: GLM-130B significantly outperforms GPT-3 175B (davinci) on several popular English benchmarks, while such performance advantages are not observed in OPT-175B and BLOOM-176B. On Chinese-related benchmarks, GLM-130B also significantly outperforms ERNIE TITAN 3.0 260B—the largest Chinese language model. 2. **Addressing Training Stability and Efficiency Issues**: When training models with 100 billion parameters, researchers faced many technical challenges, particularly in terms of loss spikes and divergence. The paper details the process of training GLM-130B, including design choices, training strategies, and engineering efforts to ensure training efficiency and stability. 3. **Reducing Inference Costs**: By leveraging the unique scaling properties of GLM-130B, researchers achieved INT4 quantization without post-training, with almost no performance loss. This allows GLM-130B to perform efficient inference on 4 RTX 3090 (24G) or 8 RTX 2080 Ti (11G) GPUs, which are the most cost-effective GPUs for models with 100 billion parameters. 4. **Open Sourcing**: The model weights, code, training logs, related toolkits, and lessons learned from GLM-130B are all publicly released, enabling more researchers to use and improve this model. Overall, the paper aims to advance the research and application of large language models by developing and open-sourcing GLM-130B, particularly achieving breakthroughs in performance, training stability, and inference costs.