Abstract:Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.

What problem does this paper attempt to address?

### Problems the Paper Aims to Address The paper primarily focuses on the research and development of Small Language Models (SLMs), aiming to address the following key issues: 1. **Lack of Academic Attention**: Despite the widespread application of small language models in modern intelligent devices, they receive less attention in academia compared to large language models (LLMs). The paper aims to increase the emphasis on small language model research through comprehensive surveys, measurements, and analyses. 2. **Technical Capability Assessment**: The paper provides a detailed evaluation of 59 state-of-the-art open-source small language models, including their architectural innovations, training datasets, and training algorithms. Through these evaluations, the paper aims to reveal the practical capabilities of small language models in various domains such as common-sense reasoning, contextual learning, mathematics, and programming. 3. **Runtime Cost Analysis**: To further understand the runtime costs of small language models on devices, the paper conducts benchmark tests on their inference latency and memory usage. By deeply analyzing these benchmark data, the paper offers valuable insights to advance research in this field. 4. **Future Research Directions**: The paper not only summarizes the key innovations of existing small language models but also proposes several potential research topics to guide future research efforts. ### Main Contributions - **Comprehensive Survey and Evaluation**: The paper systematically reviews small language models released in recent years, summarizes their key innovations, and benchmarks their capabilities and runtime costs on devices. - **Valuable Insights**: Through in-depth investigation, the paper derives valuable insights from open-source small language models, which may benefit future research on small language models. The paper also summarizes some potential research topics. - **Public Results and Tools**: The paper makes all results and benchmarking tools publicly available to promote and accelerate research on small language models. ### Key Findings - **Typical Small Language Model Architectures**: As of August 2024, typical small language models tend to use Group-Query Attention mechanisms, Gated FFNs with SiLU activation functions, FFN intermediate ratios of 2 to 8, RMS normalization, and vocabularies larger than 50K. However, the choice of these settings is mainly empirical, with no rigorous public validation to prove their superiority. - **Impact of Architectural Innovations**: Architectural innovations have a relatively significant impact on runtime performance on devices, but their effect on enhancing model capacity is limited. Currently, apart from sharing weights between the embedding layer and the final language model head layer, other architectural innovations have not been widely adopted or studied. ### Training Datasets The paper also investigates the use of open-source pre-training datasets in training small language models, identifying 12 commonly used datasets, including The Pile, FineWeb-Edu, StarCoder, Cosmopedia, RefinedWeb, RedPajama, Dolma, WuDaoCorpora, RoBERTa CCNewsV2, PushShift Reddit, DCLM-baseline, and CulturaX. Among them, The Pile is the most commonly used pre-training dataset, especially in 2022 and 2023. However, more datasets have been proposed recently, making the choices more diverse.

Small Language Models: Survey, Measurements, and Insights

A Survey of Small Language Models

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

What is the Role of Small Models in the LLM Era: A Survey

Small Language Models for Application Interactions: A Case Study

Computational Bottlenecks of Training Small-scale Large Language Models

A Survey of Large Language Models

Super Tiny Language Models

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?

PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training

A Survey on Efficient Inference for Large Language Models

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

History, Development, and Principles of Large Language Models-An Introductory Survey

Model Compression and Efficient Inference for Large Language Models: A Survey

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Scientific Large Language Models: A Survey on Biological & Chemical Domains

On-Device Language Models: A Comprehensive Review

SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

Efficient Large Language Models: A Survey

Large language models as linguistic simulators and cognitive models in human research