Dynamic data sampler for cross-language transfer learning in large language models

Yudong Li,Yuhao Feng,Wen Zhou,Zhe Zhao,Linlin Shen,Cheng Hou,Xianxu Hou
2024-05-17
Abstract:Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the challenges in training large-scale language models (LLMs) for non-English languages, particularly Chinese, and proposes a new method called ChatFlow. Specifically, the paper aims to solve the following key issues: 1. **Cost-effective training of cross-lingual large-scale language models**: Due to the difficulty in obtaining high-quality non-English corpora and the significant demand for computational resources, training large-scale language models for non-English languages faces numerous challenges. ChatFlow aims to train high-quality Chinese large-scale language models in a cost-effective manner. 2. **Knowledge transfer and representation alignment**: By leveraging the pre-trained knowledge of English language models and combining mixed Chinese-English corpora and parallel corpora, ChatFlow aims to achieve cross-lingual knowledge transfer and align representations between different languages, thereby facilitating the effective transfer of knowledge from English models to Chinese models. 3. **Smooth transition from pre-training to fine-tuning**: Traditional large-scale language model training typically involves two stages: pre-training and supervised fine-tuning. This sudden change in data distribution can lead to the model forgetting previously learned knowledge. ChatFlow introduces a dynamic data sampler to achieve a smooth transition from pre-training to fine-tuning, thereby avoiding this issue. Through the above methods, ChatFlow can effectively improve the performance of Chinese language models with relatively less data and computational resources, achieving excellent results on multiple benchmark tests. Additionally, the researchers have provided public access to the code and weights to support reproducibility and further research.