Abstract:In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves \emph{state of the art} performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

What problem does this paper attempt to address?

The main issue this paper attempts to address is the commercialization and reduced transparency in the current development trends of large language models (LLMs). Specifically: 1. **Openness and Transparency**: As the commercial value of large language models is widely recognized, many organizations, while releasing their models, withhold critical information necessary for model reproduction. This practice severely impacts the progress in the research field and hinders collaboration and transparency within the open-source community. 2. **Accessibility of High-Quality Bilingual Models**: Although there are some large language models available, these models often focus on a single language, particularly English. For applications requiring high-quality Chinese processing capabilities, existing models are insufficient. 3. **Data Contamination Issues**: The paper also points out that data contamination is an urgent issue that affects the effectiveness and fairness of model training. To address these issues, the paper introduces **Skywork-13B**, a bilingual large language model with 13 billion parameters, aiming to promote openness and transparency through the following aspects: - **Public Release of the Model**: Skywork-13B includes not only the base model but also an optimized dialogue version, and it releases intermediate checkpoints during the training process so that other researchers can better understand and reproduce the model's capability development process. - **Open High-Quality Corpus**: The paper's team also released a portion of the SkyPile corpus, a high-quality Chinese pre-training corpus containing over 150 billion tokens, to promote research in Chinese language models. - **Data Contamination Detection Method**: A new data contamination detection method is proposed to facilitate further research on this issue. In summary, by releasing Skywork-13B and its related resources, the paper aims to revive the spirit of the open-source community and enhance transparency and collaboration in the field of natural language processing.

Skywork: A More Open Bilingual Foundation Model

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

YuLan: An Open-source Large Language Model

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

SkyMath: Technical Report

Tele-FLM Technical Report

Baichuan 2: Open Large-scale Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xmodel-LM Technical Report

LLM360: Towards Fully Transparent Open-Source LLMs

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

YAYI 2: Multilingual Open-Source Large Language Models

TeleChat Technical Report

PolyLM: An Open Source Polyglot Large Language Model

Qwen Technical Report

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

InternLM-Law: An Open Source Chinese Legal Large Language Model

ChuXin: 1.6B Technical Report