Abstract:Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance and generalization ability of language models on unseen tasks through instruction finetuning. Specifically, the paper explores the following aspects: 1. **Expansion of the number of tasks**: Research on the impact of increasing the number of tasks used for finetuning on model performance. 2. **Expansion of model scale**: Research on the performance of models of different scales after instruction finetuning, especially the performance of large - scale models (such as models with 540B parameters). 3. **Finetuning of Chain - of - Thought data (Chain - of - Thought, CoT)**: Research on the effect of adding Chain - of - Thought data during the finetuning process on the improvement of the model's reasoning ability. ### Main findings - **The increase in the number of tasks significantly improves model performance**: As the number of tasks increases, the performance of the model on multiple evaluation benchmarks has been significantly improved. For example, the Flan - PaLM 540B model finetuned with 1.8K tasks outperforms the unfinetuned PaLM 540B model on multiple benchmarks, with an average performance improvement of 9.4%. - **The expansion of model scale further improves performance**: Whether it is an unfinetuned or finetuned model, the increase in model scale significantly improves its performance on various tasks. - **Finetuning of Chain - of - Thought data enhances reasoning ability**: Finetuning with Chain - of - Thought data not only improves the performance of the model on reasoning tasks, but also unlocks the ability of zero - shot reasoning. For example, Flan - PaLM 540B achieves an accuracy rate of 75.2% on the MMLU benchmark, which is significantly better than other models. ### Experimental results - **Multi - task finetuning significantly improves the generalization ability of the model**: Whether under direct prompting or Chain - of - Thought prompting, the finetuned model performs well on multiple benchmarks. - **The importance of Chain - of - Thought data**: Using only non - Chain - of - Thought data for finetuning will significantly reduce the performance of the model on Chain - of - Thought tasks, while using a combination of non - Chain - of - Thought and Chain - of - Thought data can maintain or even improve the performance of the model on all tasks. - **Improvement of zero - shot reasoning ability**: The finetuned model can generate Chain - of - Thought reasoning in a zero - shot setting, which indicates that the model has stronger adaptive reasoning ability. ### Conclusion The paper verifies the effectiveness of instruction finetuning through large - scale experiments, especially in terms of the expansion of the number of tasks, model scale, and Chain - of - Thought data. These findings provide important references for future research, indicating that continuing to expand the amount of instruction finetuning data and model scale will further improve the performance and generalization ability of language models.

Scaling Instruction-Finetuned Language Models

Finetuned Language Models Are Zero-Shot Learners

Fine-tuning Large Language Models with Sequential Instructions

Phased Instruction Fine-Tuning for Large Language Models

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Maybe Only 0.5 Training Data Instruction Tuning

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models

Parameter Efficient Instruction Tuning: An Empirical Study

On Instruction-Finetuning Neural Machine Translation Models

Exploring Format Consistency for Instruction Tuning

Instruction Tuning for Large Language Models: A Survey

MAmmoTH2: Scaling Instructions from the Web

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following