Small Language Models: Survey, Measurements, and Insights

Zhenyan Lu,Xiang Li,Dongqi Cai,Rongjie Yi,Fangming Liu,Xiwen Zhang,Nicholas D. Lane,Mengwei Xu
2024-09-24
Abstract:Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Address The paper primarily focuses on the research and development of Small Language Models (SLMs), aiming to address the following key issues: 1. **Lack of Academic Attention**: Despite the widespread application of small language models in modern intelligent devices, they receive less attention in academia compared to large language models (LLMs). The paper aims to increase the emphasis on small language model research through comprehensive surveys, measurements, and analyses. 2. **Technical Capability Assessment**: The paper provides a detailed evaluation of 59 state-of-the-art open-source small language models, including their architectural innovations, training datasets, and training algorithms. Through these evaluations, the paper aims to reveal the practical capabilities of small language models in various domains such as common-sense reasoning, contextual learning, mathematics, and programming. 3. **Runtime Cost Analysis**: To further understand the runtime costs of small language models on devices, the paper conducts benchmark tests on their inference latency and memory usage. By deeply analyzing these benchmark data, the paper offers valuable insights to advance research in this field. 4. **Future Research Directions**: The paper not only summarizes the key innovations of existing small language models but also proposes several potential research topics to guide future research efforts. ### Main Contributions - **Comprehensive Survey and Evaluation**: The paper systematically reviews small language models released in recent years, summarizes their key innovations, and benchmarks their capabilities and runtime costs on devices. - **Valuable Insights**: Through in-depth investigation, the paper derives valuable insights from open-source small language models, which may benefit future research on small language models. The paper also summarizes some potential research topics. - **Public Results and Tools**: The paper makes all results and benchmarking tools publicly available to promote and accelerate research on small language models. ### Key Findings - **Typical Small Language Model Architectures**: As of August 2024, typical small language models tend to use Group-Query Attention mechanisms, Gated FFNs with SiLU activation functions, FFN intermediate ratios of 2 to 8, RMS normalization, and vocabularies larger than 50K. However, the choice of these settings is mainly empirical, with no rigorous public validation to prove their superiority. - **Impact of Architectural Innovations**: Architectural innovations have a relatively significant impact on runtime performance on devices, but their effect on enhancing model capacity is limited. Currently, apart from sharing weights between the embedding layer and the final language model head layer, other architectural innovations have not been widely adopted or studied. ### Training Datasets The paper also investigates the use of open-source pre-training datasets in training small language models, identifying 12 commonly used datasets, including The Pile, FineWeb-Edu, StarCoder, Cosmopedia, RefinedWeb, RedPajama, Dolma, WuDaoCorpora, RoBERTa CCNewsV2, PushShift Reddit, DCLM-baseline, and CulturaX. Among them, The Pile is the most commonly used pre-training dataset, especially in 2022 and 2023. However, more datasets have been proposed recently, making the choices more diverse.