TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang,Guangtao Zeng,Tianduo Wang,Wei Lu
2024-06-04
Abstract:We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at <a class="link-external link-https" href="https://github.com/jzhang38/TinyLlama" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the exploration of the potential of training small language models (such as TinyLlama) on large-scale datasets, especially in the context where most current research tends to focus on training large language models. Specifically, the paper focuses on the following points: 1. **Combining small models with large data**: Although existing research shows that large language models perform well on various tasks, the effectiveness of training small models and pre-training them with large amounts of data has not been fully explored. The paper validates this combination by training a Transformer decoder model with 110 million parameters (TinyLlama) using approximately 3 trillion tokens of data. 2. **Multi-stage pre-training strategy**: The paper proposes a multi-stage pre-training method, including basic pre-training, domain-specific continuous pre-training, and a cooling stage. This method aims to improve the model's performance on different tasks, especially in domain-specific tasks. 3. **Optimizing training efficiency**: To efficiently train small models on large-scale datasets, the paper employs various optimization techniques, such as Fully Sharded Data Parallel (FSDP) and FlashAttention, significantly improving training speed and efficiency. 4. **Model performance evaluation**: The paper comprehensively evaluates TinyLlama's performance on tasks such as common sense reasoning and problem-solving, comparing it with existing open-source language models (such as OPT-1.3B, Pythia-1.0B, etc.), demonstrating its superior performance on multiple tasks. In summary, through the TinyLlama project, this paper explores the feasibility and advantages of training small language models on large-scale datasets, providing new ideas and methods for future research.