Ziya2: Data-centric Learning is All LLMs Need

Ruyi Gan,Ziwei Wu,Renliang Sun,Junyu Lu,Xiaojun Wu,Dixiang Zhang,Kunhao Pan,Junqing He,Yuanhe Tian,Ping Yang,Qi Yang,Hao Wang,Jiaxing Zhang,Yan Song
2024-04-05
Abstract:Various large language models (LLMs) have been proposed in recent years, including closed- and open-source ones, continually setting new records on multiple benchmarks. However, the development of LLMs still faces several issues, such as high cost of training models from scratch, and continual pre-training leading to catastrophic forgetting, etc. Although many such issues are addressed along the line of research on LLMs, an important yet practical limitation is that many studies overly pursue enlarging model sizes without comprehensively analyzing and optimizing the use of pre-training data in their learning process, as well as appropriate organization and leveraging of such data in training LLMs under cost-effective settings. In this work, we propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens, where we focus on pre-training techniques and use data-centric optimization to enhance the learning process of Ziya2 on different stages. We define three data attributes and firstly establish data-centric scaling laws to illustrate how different data impacts LLMs. Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones. Ziya2 (Base) is released at <a class="link-external link-https" href="https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several key challenges faced by the development of large - language models (LLMs): 1. **High - cost Training**: Pretraining a model from scratch requires long - time training on many GPUs, which makes the training process of LLMs extremely costly. Although continuous pretraining provides a relatively cost - effective solution, it may be accompanied by the problem of catastrophic forgetting, that is, the model may forget what it has learned before, which is mainly due to the difference in data distribution. 2. **Lack of Open - source Data**: Open - source LLMs usually do not include open - source datasets. The effectiveness of data processing methods directly affects the performance of LLMs, but currently there is no standardized methodology or standard for cleaning pretraining data. 3. **The Relationship between Data Quality and Model Performance**: Many studies focus too much on expanding the model parameter capacity and the amount of pretraining data to improve model performance, while ignoring the impact of pretraining data quality on model performance. Specifically, as far as we know, no study has investigated which properties of pretraining data have the greatest impact on LLMs. It is also worth exploring which type of data should be given priority under the constraints of computational budget and a certain number of parameters. To address these challenges, the paper proposes the Ziya2 model, which is a model that uses LLaMA2 as the base model and is further pretrained on 70 billion tokens. The focus of the paper is on pretraining techniques and using data - centric optimization to enhance the learning process of Ziya2 at different stages. The author defines three data attributes and for the first time establishes data - centric scaling laws to illustrate how different data affect LLMs. The experimental results show that Ziya2 significantly outperforms other models in multiple benchmark tests, especially when compared with representative open - source models.