Baichuan 2: Open Large-scale Language Models

Aiyuan Yang,Bin Xiao,Bingning Wang,Borong Zhang,Ce Bian,Chao Yin,Chenxu Lv,Da Pan,Dian Wang,Dong Yan,Fan Yang,Fei Deng,Feng Wang,Feng Liu,Guangwei Ai,Guosheng Dong,Haizhou Zhao,Hang Xu,Haoze Sun,Hongda Zhang,Hui Liu,Jiaming Ji,Jian Xie,JunTao Dai,Kun Fang,Lei Su,Liang Song,Lifeng Liu,Liyun Ru,Luyao Ma,Mang Wang,Mickel Liu,MingAn Lin,Nuolan Nie,Peidong Guo,Ruiyang Sun,Tao Zhang,Tianpeng Li,Tianyu Li,Wei Cheng,Weipeng Chen,Xiangrong Zeng,Xiaochuan Wang,Xiaoxi Chen,Xin Men,Xin Yu,Xuehai Pan,Yanjun Shen,Yiding Wang,Yiyu Li,Youxin Jiang,Yuchen Gao,Yupeng Zhang,Zenan Zhou,Zhiying Wu

2023-09-20

Abstract:Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.

Computation and Language

What problem does this paper attempt to address?

The main goal of this paper is to introduce and release a series of large-scale multilingual models—Baichuan 2, including two versions with 7 billion and 13 billion parameters. These models are trained from scratch on 2.6 trillion tokens and aim to address the following key issues: 1. **Openness and Transparency**: Many current powerful large language models (such as GPT-4, PaLM-2, etc.) are closed-source, which limits researchers' access to model parameters, making in-depth research or fine-tuning difficult. Baichuan 2 promotes research progress in this field by open-sourcing all its pre-trained checkpoints. 2. **Multilingual Support**: Most existing open-source large-scale language models mainly focus on English, such as LLaMA, which primarily uses English data for pre-training. Baichuan 2 is optimized for multiple languages, especially Chinese, thereby enhancing performance on specific language tasks. 3. **Performance Improvement**: Compared to the previous generation models, Baichuan 2 shows significant performance improvements in multiple benchmarks, particularly in tasks related to mathematics, code generation, and the medical and legal fields. 4. **Security and Ethical Considerations**: The paper also emphasizes the security and ethical issues of the model and proposes a series of measures to enhance the model's security, ensuring its outputs align with social values. By open-sourcing these models and the intermediate results during their training process, the research team hopes to encourage the community to further explore the training dynamics and applications of large language models.

Baichuan 2: Open Large-scale Language Models

YAYI 2: Multilingual Open-Source Large Language Models

YuLan: An Open-source Large Language Model

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Qwen Technical Report

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Xmodel-1.5: An 1B-scale Multilingual LLM

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

CMMLU: Measuring massive multitask language understanding in Chinese

Xmodel-LM Technical Report

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

InternLM2 Technical Report

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

PolyLM: An Open Source Polyglot Large Language Model

Qwen2 Technical Report

Skywork: A More Open Bilingual Foundation Model

A Survey of Large Language Models

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

Multilingual Large Language Models: A Systematic Survey