Abstract:Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

What problem does this paper attempt to address?

The paper primarily addresses the challenges in training large-scale language models (LLMs) for non-English languages, particularly Chinese, and proposes a new method called ChatFlow. Specifically, the paper aims to solve the following key issues: 1. **Cost-effective training of cross-lingual large-scale language models**: Due to the difficulty in obtaining high-quality non-English corpora and the significant demand for computational resources, training large-scale language models for non-English languages faces numerous challenges. ChatFlow aims to train high-quality Chinese large-scale language models in a cost-effective manner. 2. **Knowledge transfer and representation alignment**: By leveraging the pre-trained knowledge of English language models and combining mixed Chinese-English corpora and parallel corpora, ChatFlow aims to achieve cross-lingual knowledge transfer and align representations between different languages, thereby facilitating the effective transfer of knowledge from English models to Chinese models. 3. **Smooth transition from pre-training to fine-tuning**: Traditional large-scale language model training typically involves two stages: pre-training and supervised fine-tuning. This sudden change in data distribution can lead to the model forgetting previously learned knowledge. ChatFlow introduces a dynamic data sampler to achieve a smooth transition from pre-training to fine-tuning, thereby avoiding this issue. Through the above methods, ChatFlow can effectively improve the performance of Chinese language models with relatively less data and computational resources, achieving excellent results on multiple benchmark tests. Additionally, the researchers have provided public access to the code and weights to support reproducibility and further research.

Dynamic data sampler for cross-language transfer learning in large language models

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

Why Not Transform Chat Large Language Models to Non-English?

Using Large Language Model for End-to-End Chinese ASR and NER

Chat Vector: A Simple Approach to Equip LLMs With New Language Chat Capabilities

BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

Mutual Enhancement of Large and Small Language Models with Cross-Silo Knowledge Transfer

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Extrapolating Large Language Models to Non-English by Aligning Languages

Supervised Knowledge Makes Large Language Models Better In-context Learners

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

SilverSight: A Multi-Task Chinese Financial Large Language Model Based on Adaptive Semantic Space Learning

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Large Language Model Enhanced Machine Learning Estimators for Classification

CAT-LLM: Prompting Large Language Models with Text Style Definition for Chinese Article-style Transfer

Online Training of Large Language Models: Learn while chatting

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

LoRA-Flow: Dynamic LoRA Fusion for Large Language Models in Generative Tasks

LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data