Skywork: A More Open Bilingual Foundation Model

Tianwen Wei,Liang Zhao,Lichang Zhang,Bo Zhu,Lijie Wang,Haihua Yang,Biye Li,Cheng Cheng,Weiwei Lü,Rui Hu,Chenxia Li,Liu Yang,Xilin Luo,Xuejie Wu,Lunan Liu,Wenjun Cheng,Peng Cheng,Jianhao Zhang,Xiaoyu Zhang,Lei Lin,Xiaokun Wang,Yutuan Ma,Chuanhai Dong,Yanqi Sun,Yifu Chen,Yongyi Peng,Xiaojuan Liang,Shuicheng Yan,Han Fang,Yahui Zhou
2023-10-30
Abstract:In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves \emph{state of the art} performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main issue this paper attempts to address is the commercialization and reduced transparency in the current development trends of large language models (LLMs). Specifically: 1. **Openness and Transparency**: As the commercial value of large language models is widely recognized, many organizations, while releasing their models, withhold critical information necessary for model reproduction. This practice severely impacts the progress in the research field and hinders collaboration and transparency within the open-source community. 2. **Accessibility of High-Quality Bilingual Models**: Although there are some large language models available, these models often focus on a single language, particularly English. For applications requiring high-quality Chinese processing capabilities, existing models are insufficient. 3. **Data Contamination Issues**: The paper also points out that data contamination is an urgent issue that affects the effectiveness and fairness of model training. To address these issues, the paper introduces **Skywork-13B**, a bilingual large language model with 13 billion parameters, aiming to promote openness and transparency through the following aspects: - **Public Release of the Model**: Skywork-13B includes not only the base model but also an optimized dialogue version, and it releases intermediate checkpoints during the training process so that other researchers can better understand and reproduce the model's capability development process. - **Open High-Quality Corpus**: The paper's team also released a portion of the SkyPile corpus, a high-quality Chinese pre-training corpus containing over 150 billion tokens, to promote research in Chinese language models. - **Data Contamination Detection Method**: A new data contamination detection method is proposed to facilitate further research on this issue. In summary, by releasing Skywork-13B and its related resources, the paper aims to revive the spirit of the open-source community and enhance transparency and collaboration in the field of natural language processing.