Instruction Pre-Training: Language Models are Supervised Multitask Learners

Daixuan Cheng,Yuxian Gu,Shaohan Huang,Junyu Bi,Minlie Huang,Furu Wei
2024-06-21
Abstract:Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at <a class="link-external link-https" href="https://github.com/microsoft/LMOps" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the method of introducing supervised multi-task learning during the pre-training process of language models (LMs) and proposes a new framework called "Instruction Pre-Training." The paper aims to address the following issues: 1. **Enhancing the generalization ability of language models**: Although unsupervised multi-task pre-training has already achieved success in language models, supervised multi-task learning still holds great potential. The paper aims to improve the adaptability and generalization ability of language models to different tasks through a new pre-training method. 2. **Improving the effective utilization of pre-training data**: Traditional pre-training methods mainly rely on large-scale raw corpora for causal language modeling. This study enhances these corpora by adding instruction-response pairs to the raw corpora, thereby utilizing the data more effectively. 3. **Exploring efficient instruction generation mechanisms**: To achieve the above goals, the paper also develops an instruction synthesizer that can generate diverse instruction-response pairs based on the given raw text. This method not only improves the quality and diversity of task synthesis but is also more efficient and cost-effective compared to existing methods. 4. **Validating the effectiveness of the proposed method**: The paper extensively validates the effectiveness of the instruction pre-training method through experiments, including general pre-training from scratch and continuous pre-training in specific domains. The results show that language models pre-trained using this method perform excellently on general tasks and can achieve or exceed the performance of larger models on specific domain tasks. In summary, the main goal of the paper is to improve the performance of language models, particularly their generalization ability and adaptability to new tasks, by introducing supervised multi-task learning. This goal is achieved by developing a new pre-training framework—Instruction Pre-Training—and combining it with efficient instruction generation techniques.