MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu,Yuge Tu,Xu Han,Chaoqun He,Ganqu Cui,Xiang Long,Zhi Zheng,Yewei Fang,Yuxiang Huang,Weilin Zhao,Xinrong Zhang,Zheng Leng Thai,Kaihuo Zhang,Chongyi Wang,Yuan Yao,Chenyang Zhao,Jie Zhou,Jie Cai,Zhongwu Zhai,Ning Ding,Chao Jia,Guoyang Zeng,Dahai Li,Zhiyuan Liu,Maosong Sun
2024-06-03
Abstract:The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at <a class="link-external link-https" href="https://github.com/OpenBMB/MiniCPM" rel="external noopener nofollow">this https URL</a> .
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: ### Main Issues Addressed by the Paper 1. **Resource Efficiency and Practical Costs**: Large Language Models (LLMs) are powerful but have high training costs and consume enormous resources, limiting the experimental capabilities of most researchers and enterprises. Additionally, deploying these large-scale models on end devices such as personal computers or smartphones is inefficient or even unfeasible. 2. **Exploring the Potential of Small Language Models (SLMs)**: Given the issues with LLMs, the paper focuses on exploring the potential of SLMs as resource-efficient alternatives, particularly in terms of capabilities comparable to larger models. 3. **Scalability Strategies**: The paper proposes a set of training methods for SLMs that are not only applicable to current SLMs but can also guide the development of future LLMs, especially in terms of scalability in both model size and data size. ### Specific Contributions - **MiniCPM Series Models**: Introduces a series of SLMs named MiniCPM, particularly variants with 120 million and 240 million non-embedding parameters. These models perform excellently within their respective size categories and can rival language models with 700 million to 1.3 billion parameters in terms of performance. - **Scalable Training Strategies**: Through extensive model wind tunnel experiments for hyperparameter optimization, the paper introduces a learning rate scheduler named Warmup-Stable-Decay (WSD) to support continuous training and domain adaptation. This learning rate scheduler helps efficiently study the data-model scaling laws without requiring numerous retraining experiments. - **Data-Model Scaling Laws**: Utilizing the WSD learning rate scheduler, the paper investigates data-model scaling laws and discovers a ratio much higher than the Chinchilla optimal ratio, indicating that more emphasis should be placed on increasing data size when scaling computational resources. - **MiniCPM Family**: Besides the base models, the paper also introduces other members of the MiniCPM family, including MiniCPM-DPO, MiniCPM-128K, and MiniCPM-MoE, which perform excellently across multiple benchmarks. In summary, the paper aims to demonstrate the significant potential of SLMs through the MiniCPM series models and their scalable training strategies, providing guidance for building more scientific and sustainable large-scale language models.