Abstract:Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at <a class="link-external link-https" href="https://github.com/microsoft/LMOps" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily explores the method of introducing supervised multi-task learning during the pre-training process of language models (LMs) and proposes a new framework called "Instruction Pre-Training." The paper aims to address the following issues: 1. **Enhancing the generalization ability of language models**: Although unsupervised multi-task pre-training has already achieved success in language models, supervised multi-task learning still holds great potential. The paper aims to improve the adaptability and generalization ability of language models to different tasks through a new pre-training method. 2. **Improving the effective utilization of pre-training data**: Traditional pre-training methods mainly rely on large-scale raw corpora for causal language modeling. This study enhances these corpora by adding instruction-response pairs to the raw corpora, thereby utilizing the data more effectively. 3. **Exploring efficient instruction generation mechanisms**: To achieve the above goals, the paper also develops an instruction synthesizer that can generate diverse instruction-response pairs based on the given raw text. This method not only improves the quality and diversity of task synthesis but is also more efficient and cost-effective compared to existing methods. 4. **Validating the effectiveness of the proposed method**: The paper extensively validates the effectiveness of the instruction pre-training method through experiments, including general pre-training from scratch and continuous pre-training in specific domains. The results show that language models pre-trained using this method perform excellently on general tasks and can achieve or exceed the performance of larger models on specific domain tasks. In summary, the main goal of the paper is to improve the performance of language models, particularly their generalization ability and adaptability to new tasks, by introducing supervised multi-task learning. This goal is achieved by developing a new pre-training framework—Instruction Pre-Training—and combining it with efficient instruction generation techniques.

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Multimodal Pretraining from Monolingual to Multilingual

Towards Effective and Efficient Continual Pre-training of Large Language Models

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Instruction Position Matters in Sequence Generation with Large Language Models

MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

On Domain-Specific Post-Training for Multimodal Large Language Models

Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

Efficient Continual Pre-training by Mitigating the Stability Gap

Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

In-context Pretraining: Language Modeling Beyond Document Boundaries

A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

InstructionCP: A fast approach to transfer Large Language Models into target language

Demystifying Instruction Mixing for Fine-tuning Large Language Models

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Multilingual Pre-training with Universal Dependency Learning.

The Construction of Instruction-tuned LLMs for Finance without Instruction Data Using Continual Pretraining and Model Merging

Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions