Abstract:We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at <a class="link-external link-https" href="https://github.com/jzhang38/TinyLlama" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is the exploration of the potential of training small language models (such as TinyLlama) on large-scale datasets, especially in the context where most current research tends to focus on training large language models. Specifically, the paper focuses on the following points: 1. **Combining small models with large data**: Although existing research shows that large language models perform well on various tasks, the effectiveness of training small models and pre-training them with large amounts of data has not been fully explored. The paper validates this combination by training a Transformer decoder model with 110 million parameters (TinyLlama) using approximately 3 trillion tokens of data. 2. **Multi-stage pre-training strategy**: The paper proposes a multi-stage pre-training method, including basic pre-training, domain-specific continuous pre-training, and a cooling stage. This method aims to improve the model's performance on different tasks, especially in domain-specific tasks. 3. **Optimizing training efficiency**: To efficiently train small models on large-scale datasets, the paper employs various optimization techniques, such as Fully Sharded Data Parallel (FSDP) and FlashAttention, significantly improving training speed and efficiency. 4. **Model performance evaluation**: The paper comprehensively evaluates TinyLlama's performance on tasks such as common sense reasoning and problem-solving, comparing it with existing open-source language models (such as OPT-1.3B, Pythia-1.0B, etc.), demonstrating its superior performance on multiple tasks. In summary, through the TinyLlama project, this paper explores the feasibility and advantages of training small language models on large-scale datasets, providing new ideas and methods for future research.

TinyLlama: An Open-Source Small Language Model

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

Code Llama: Open Foundation Models for Code

Super Tiny Language Models

Xmodel-LM Technical Report

The Llama 3 Herd of Models

LLaMA: Open and Efficient Foundation Language Models

YuLan: An Open-source Large Language Model

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Llama 2: Open Foundation and Fine-Tuned Chat Models

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Tamil-Llama: A New Tamil Language Model Based on Llama 2

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

Rethinking Optimization and Architecture for Tiny Language Models

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Small Language Models: Survey, Measurements, and Insights