Abstract:The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at <a class="link-external link-https" href="https://github.com/OpenBMB/MiniCPM" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The paper primarily aims to address the following issues: ### Main Issues Addressed by the Paper 1. **Resource Efficiency and Practical Costs**: Large Language Models (LLMs) are powerful but have high training costs and consume enormous resources, limiting the experimental capabilities of most researchers and enterprises. Additionally, deploying these large-scale models on end devices such as personal computers or smartphones is inefficient or even unfeasible. 2. **Exploring the Potential of Small Language Models (SLMs)**: Given the issues with LLMs, the paper focuses on exploring the potential of SLMs as resource-efficient alternatives, particularly in terms of capabilities comparable to larger models. 3. **Scalability Strategies**: The paper proposes a set of training methods for SLMs that are not only applicable to current SLMs but can also guide the development of future LLMs, especially in terms of scalability in both model size and data size. ### Specific Contributions - **MiniCPM Series Models**: Introduces a series of SLMs named MiniCPM, particularly variants with 120 million and 240 million non-embedding parameters. These models perform excellently within their respective size categories and can rival language models with 700 million to 1.3 billion parameters in terms of performance. - **Scalable Training Strategies**: Through extensive model wind tunnel experiments for hyperparameter optimization, the paper introduces a learning rate scheduler named Warmup-Stable-Decay (WSD) to support continuous training and domain adaptation. This learning rate scheduler helps efficiently study the data-model scaling laws without requiring numerous retraining experiments. - **Data-Model Scaling Laws**: Utilizing the WSD learning rate scheduler, the paper investigates data-model scaling laws and discovers a ratio much higher than the Chinchilla optimal ratio, indicating that more emphasis should be placed on increasing data size when scaling computational resources. - **MiniCPM Family**: Besides the base models, the paper also introduces other members of the MiniCPM family, including MiniCPM-DPO, MiniCPM-128K, and MiniCPM-MoE, which perform excellently across multiple benchmarks. In summary, the paper aims to demonstrate the significant potential of SLMs through the MiniCPM series models and their scalable training strategies, providing guidance for building more scientific and sustainable large-scale language models.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

CPM-2: Large-scale Cost-effective Pre-trained Language Models

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Cross-model Control: Improving Multiple Large Language Models in One-time Training

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

Super Tiny Language Models

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

YuLan-Mini: An Open Data-efficient Language Model

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Small Language Models: Survey, Measurements, and Insights

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

Nanolm: an Affordable LLM Pre-training Benchmark Via Accurate Loss Prediction Across Scales

What is the Role of Small Models in the LLM Era: A Survey

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

PaLM: Scaling Language Modeling with Pathways

CPM: A large-scale generative Chinese Pre-trained language model

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

A Survey of Small Language Models

CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models