GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng,Xiao Liu,Zhengxiao Du,Zihan Wang,Hanyu Lai,Ming Ding,Zhuoyi Yang,Yifan Xu,Wendi Zheng,Xiao Xia,Weng Lam Tam,Zixuan Ma,Yufei Xue,Jidong Zhai,Wenguang Chen,Peng Zhang,Yuxiao Dong,Jie Tang

2023-10-25

Abstract:We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{<a class="link-external link-https" href="https://github.com/THUDM/GLM-130B/" rel="external noopener nofollow">this https URL</a>}.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The main problem this paper attempts to address is the development of an open-source, high-performance large language model (LLM), specifically targeting models with 100 billion parameters. Specifically, the paper introduces GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. The goals of the paper include: 1. **Improving Model Performance**: GLM-130B significantly outperforms GPT-3 175B (davinci) on several popular English benchmarks, while such performance advantages are not observed in OPT-175B and BLOOM-176B. On Chinese-related benchmarks, GLM-130B also significantly outperforms ERNIE TITAN 3.0 260B—the largest Chinese language model. 2. **Addressing Training Stability and Efficiency Issues**: When training models with 100 billion parameters, researchers faced many technical challenges, particularly in terms of loss spikes and divergence. The paper details the process of training GLM-130B, including design choices, training strategies, and engineering efforts to ensure training efficiency and stability. 3. **Reducing Inference Costs**: By leveraging the unique scaling properties of GLM-130B, researchers achieved INT4 quantization without post-training, with almost no performance loss. This allows GLM-130B to perform efficient inference on 4 RTX 3090 (24G) or 8 RTX 2080 Ti (11G) GPUs, which are the most cost-effective GPUs for models with 100 billion parameters. 4. **Open Sourcing**: The model weights, code, training logs, related toolkits, and lessons learned from GLM-130B are all publicly released, enabling more researchers to use and improve this model. Overall, the paper aims to advance the research and application of large language models by developing and open-sourcing GLM-130B, particularly achieving breakthroughs in performance, training stability, and inference costs.

GLM-130B: An Open Bilingual Pre-trained Model

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

CPM: A large-scale generative Chinese Pre-trained language model

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

GEB-1.3B: Open Lightweight Large Language Model

BMInf: An Efficient Toolkit for Big Model Inference and Tuning

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

FLM-101B: An Open LLM and How to Train It with $100K Budget

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

OPT: Open Pre-trained Transformer Language Models

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

mGPT: Few-Shot Learners Go Multilingual

CPM-2: Large-scale Cost-effective Pre-trained Language Models