Abstract:Various large language models (LLMs) have been proposed in recent years, including closed- and open-source ones, continually setting new records on multiple benchmarks. However, the development of LLMs still faces several issues, such as high cost of training models from scratch, and continual pre-training leading to catastrophic forgetting, etc. Although many such issues are addressed along the line of research on LLMs, an important yet practical limitation is that many studies overly pursue enlarging model sizes without comprehensively analyzing and optimizing the use of pre-training data in their learning process, as well as appropriate organization and leveraging of such data in training LLMs under cost-effective settings. In this work, we propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens, where we focus on pre-training techniques and use data-centric optimization to enhance the learning process of Ziya2 on different stages. We define three data attributes and firstly establish data-centric scaling laws to illustrate how different data impacts LLMs. Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones. Ziya2 (Base) is released at <a class="link-external link-https" href="https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key challenges faced by the development of large - language models (LLMs): 1. **High - cost Training**: Pretraining a model from scratch requires long - time training on many GPUs, which makes the training process of LLMs extremely costly. Although continuous pretraining provides a relatively cost - effective solution, it may be accompanied by the problem of catastrophic forgetting, that is, the model may forget what it has learned before, which is mainly due to the difference in data distribution. 2. **Lack of Open - source Data**: Open - source LLMs usually do not include open - source datasets. The effectiveness of data processing methods directly affects the performance of LLMs, but currently there is no standardized methodology or standard for cleaning pretraining data. 3. **The Relationship between Data Quality and Model Performance**: Many studies focus too much on expanding the model parameter capacity and the amount of pretraining data to improve model performance, while ignoring the impact of pretraining data quality on model performance. Specifically, as far as we know, no study has investigated which properties of pretraining data have the greatest impact on LLMs. It is also worth exploring which type of data should be given priority under the constraints of computational budget and a certain number of parameters. To address these challenges, the paper proposes the Ziya2 model, which is a model that uses LLaMA2 as the base model and is further pretrained on 70 billion tokens. The focus of the paper is on pretraining techniques and using data - centric optimization to enhance the learning process of Ziya2 at different stages. The author defines three data attributes and for the first time establishes data - centric scaling laws to illustrate how different data affect LLMs. The experimental results show that Ziya2 significantly outperforms other models in multiple benchmark tests, especially when compared with representative open - source models.

Ziya2: Data-centric Learning is All LLMs Need

Zyda: A 1.3T Dataset for Open Language Modeling

YAYI 2: Multilingual Open-Source Large Language Models

YuLan: An Open-source Large Language Model

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

Baichuan 2: Open Large-scale Language Models

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Supervised Knowledge Makes Large Language Models Better In-context Learners

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

LLMBox: A Comprehensive Library for Large Language Models

InternLM2 Technical Report

Efficient Multimodal Learning from Data-centric Perspective

SilverSight: A Multi-Task Chinese Financial Large Language Model Based on Adaptive Semantic Space Learning

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models

FLM-101B: An Open LLM and How to Train It with $100K Budget