Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

Chiyu Song,Zhanchao Zhou,Jianhao Yan,Yuejiao Fei,Zhenzhong Lan,Yue Zhang
2024-02-22
Abstract:Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). However, the creation of instruction data is still largely heuristic, leading to significant variation in quantity and quality across existing datasets. While some research advocates for expanding the number of instructions, others suggest that a small set of well-chosen examples is adequate. To better understand data construction guidelines, our research provides a granular analysis of how data volume, parameter size, and data construction methods influence the development of each underlying ability of LLM, such as creative writing, code generation, and logical reasoning. We present a meticulously curated dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters. Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors. (ii) Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases, but is unachievable with synthetic data. (iii) Instruction data brings powerful cross-ability generalization, as evidenced by out-of-domain evaluations. Furthermore, we demonstrate how these findings can guide more efficient data constructions, leading to practical performance improvements on two public benchmarks.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore the impact of data volume, model parameter scale, and data construction methods on the development of different capabilities of large language models (LLMs) during the instruction tuning process. Specifically, the paper attempts to answer the following key questions: 1. **Is the impact of data volume and model parameter scale on different capabilities consistent?** - The research found that although the overall performance is closely related to data volume and model parameter scale, different capabilities have significantly different sensitivities to these factors. For example, some capabilities (such as creative writing) can be rapidly improved with a small amount of data, while other capabilities (such as ethics) require more data to be significantly improved. 2. **What are the differences in the effects of manually - labeled data and synthetic data?** - The research shows that manually - labeled data is significantly more efficient than synthetic data generated from GPT - 4. Synthetic data cannot continuously improve model performance when the data volume is increased, while manually - labeled data can continuously bring performance improvements. 3. **Can instruction data promote cross - domain generalization ability?** - The research results show that instruction data can not only improve the performance of specific tasks, but also show strong generalization ability in unseen domains. This indicates that instruction tuning helps to better align the knowledge obtained during the pre - training process to the appropriate output space. ### Main findings 1. **Data volume and model scale have a significant impact on overall performance, but the impact on different capabilities is uneven.** - Even with limited resources, some features can predict the potential for capacity improvement when expanding data or model parameters. 2. **In terms of data construction methods, the synthetic data generated by GPT - 4 performs poorly in instruction tuning.** - Data generated by methods such as Self - Instruct has low efficiency and cannot continuously improve model performance by increasing the data volume. 3. **Instruction data promotes strong cross - domain generalization ability.** - In unseen domains, different capabilities grow at different speeds, and manually - labeled data is more helpful in improving model performance than synthetic data. ### Experimental design To systematically study these problems, the author used the LLaMA series of models and proposed a new Chinese dataset DoIT, which contains more than 40,000 manually - labeled instruction instances, covering ten different LLM capability categories. The experimental design includes: - **Data volume experiment**: Uniformly sample different amounts of data in each capability category, gradually increasing from 1 sample to 1,000 samples. - **Model parameter scale experiment**: Train models with different parameter scales (700 million, 1.3 billion, and 3.3 billion parameters). - **Data construction method experiment**: Compare the effects of manually - labeled data and synthetic data. ### Conclusion Through a detailed analysis of different capabilities, the author proposes some suggestions to guide more efficient data construction strategies, so as to achieve actual performance improvements on two public benchmark tests (CMMLU and AGIEval). These findings not only help to understand the dynamic process of instruction tuning, but also provide practical guidance for future data construction.