Scaling Instruction-Finetuned Language Models

Hyung Won Chung,Le Hou,Shayne Longpre,Barret Zoph,Yi Tay,William Fedus,Yunxuan Li,Xuezhi Wang,Mostafa Dehghani,Siddhartha Brahma,Albert Webson,Shixiang Shane Gu,Zhuyun Dai,Mirac Suzgun,Xinyun Chen,Aakanksha Chowdhery,Alex Castro-Ros,Marie Pellat,Kevin Robinson,Dasha Valter,Sharan Narang,Gaurav Mishra,Adams Yu,Vincent Zhao,Yanping Huang,Andrew Dai,Hongkun Yu,Slav Petrov,Ed H. Chi,Jeff Dean,Jacob Devlin,Adam Roberts,Denny Zhou,Quoc V. Le,Jason Wei
2022-12-07
Abstract:Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance and generalization ability of language models on unseen tasks through instruction finetuning. Specifically, the paper explores the following aspects: 1. **Expansion of the number of tasks**: Research on the impact of increasing the number of tasks used for finetuning on model performance. 2. **Expansion of model scale**: Research on the performance of models of different scales after instruction finetuning, especially the performance of large - scale models (such as models with 540B parameters). 3. **Finetuning of Chain - of - Thought data (Chain - of - Thought, CoT)**: Research on the effect of adding Chain - of - Thought data during the finetuning process on the improvement of the model's reasoning ability. ### Main findings - **The increase in the number of tasks significantly improves model performance**: As the number of tasks increases, the performance of the model on multiple evaluation benchmarks has been significantly improved. For example, the Flan - PaLM 540B model finetuned with 1.8K tasks outperforms the unfinetuned PaLM 540B model on multiple benchmarks, with an average performance improvement of 9.4%. - **The expansion of model scale further improves performance**: Whether it is an unfinetuned or finetuned model, the increase in model scale significantly improves its performance on various tasks. - **Finetuning of Chain - of - Thought data enhances reasoning ability**: Finetuning with Chain - of - Thought data not only improves the performance of the model on reasoning tasks, but also unlocks the ability of zero - shot reasoning. For example, Flan - PaLM 540B achieves an accuracy rate of 75.2% on the MMLU benchmark, which is significantly better than other models. ### Experimental results - **Multi - task finetuning significantly improves the generalization ability of the model**: Whether under direct prompting or Chain - of - Thought prompting, the finetuned model performs well on multiple benchmarks. - **The importance of Chain - of - Thought data**: Using only non - Chain - of - Thought data for finetuning will significantly reduce the performance of the model on Chain - of - Thought tasks, while using a combination of non - Chain - of - Thought and Chain - of - Thought data can maintain or even improve the performance of the model on all tasks. - **Improvement of zero - shot reasoning ability**: The finetuned model can generate Chain - of - Thought reasoning in a zero - shot setting, which indicates that the model has stronger adaptive reasoning ability. ### Conclusion The paper verifies the effectiveness of instruction finetuning through large - scale experiments, especially in terms of the expansion of the number of tasks, model scale, and Chain - of - Thought data. These findings provide important references for future research, indicating that continuing to expand the amount of instruction finetuning data and model scale will further improve the performance and generalization ability of language models.