ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM,Aohan Zeng,Bin Xu,Bowen Wang,Chenhui Zhang,Da Yin,Dan Zhang,Diego Rojas,Guanyu Feng,Hanlin Zhao,Hanyu Lai,Hao Yu,Hongning Wang,Jiadai Sun,Jiajie Zhang,Jiale Cheng,Jiayi Gui,Jie Tang,Jing Zhang,Jingyu Sun,Juanzi Li,Lei Zhao,Lindong Wu,Lucen Zhong,Mingdao Liu,Minlie Huang,Peng Zhang,Qinkai Zheng,Rui Lu,Shuaiqi Duan,Shudan Zhang,Shulin Cao,Shuxun Yang,Weng Lam Tam,Wenyi Zhao,Xiao Liu,Xiao Xia,Xiaohan Zhang,Xiaotao Gu,Xin Lv,Xinghan Liu,Xinyi Liu,Xinyue Yang,Xixuan Song,Xunkai Zhang,Yifan An,Yifan Xu,Yilin Niu,Yuantao Yang,Yueyan Li,Yushi Bai,Yuxiao Dong,Zehan Qi,Zhaoyu Wang,Zhen Yang,Zhengxiao Du,Zhenyu Hou,Zihan Wang
2024-07-30
Abstract:We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through <a class="link-external link-https" href="https://github.com/THUDM" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://huggingface.co/THUDM" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper primarily introduces the latest developments in the ChatGLM series of models, particularly GLM-4 and its variants. Its core objective is to develop a series of advanced large language models (LLMs) that align with human preferences and perform well across various tasks. Specifically, the paper addresses the following key issues: 1. **Model Development and Optimization**: It describes the development journey from GLM-130B to GLM-4, including improvements in model architecture, selection and processing of pre-training data, and how techniques such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are used for model alignment. 2. **Performance Evaluation**: GLM-4 is evaluated on multiple fronts, including academic benchmarks (such as MMLU, GSM8K, etc.), instruction-following capabilities (using the IFEval dataset), long-text processing abilities (using LongBench-Chat), and the quality of Chinese alignment (using AlignBench). The results show that GLM-4 approaches or exceeds the performance of current state-of-the-art models, such as GPT-4, in several aspects. 3. **Tool Integration**: It specifically introduces the GLM-4 All Tools version, which is further optimized to understand user intent and autonomously select appropriate tools to complete complex tasks, such as using a web browser to obtain online information or using a Python interpreter to solve mathematical problems. In summary, the core contribution of this paper lies in proposing a comprehensive solution aimed at creating language models capable of efficiently handling various natural language processing tasks, particularly demonstrating strong performance in Chinese environments.