Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Young-Suk Lee,Md Arafat Sultan,Yousef El-Kurdi,Tahira Naseem Asim Munawar,Radu Florian,Salim Roukos,Ramón Fernandez Astudillo
DOI: https://doi.org/10.48550/arXiv.2310.13961
2023-10-21
Abstract:Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B--40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) Categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) Ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful outputs than their larger un-tuned counterparts. Our codebase is available at <a class="link-external link-https" href="https://github.com/IBM/ensemble-instruct" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use smaller - scale language models (with the number of parameters between 10B - 40B) to generate high - quality instruction - tuning data, so as to replace the methods that rely on large - scale language models (such as the model with 175B parameters). Specifically, the authors explore how to achieve this goal on smaller and publicly available language models through Ensemble Learning and improved In - Context Learning (ICL) techniques. ### Main contributions of the paper: 1. **Propose a new algorithm**: Ensemble - Instruct, which can generate high - quality instruction - tuning data on smaller - scale language models. 2. **Improve the Self - Instruct method**: By classifying and simplifying ICL templates and integrating the outputs of multiple language models, the quality of the generated data is improved. 3. **Verify the effectiveness of the method**: Experimental results show that the data generated using Ensemble - Instruct can significantly improve the performance of small - scale language models, and even exceed the effect of the data generated using large - scale language models. 4. **Release a synthetic dataset**: Provide approximately 45,000 synthetic samples as well as corresponding ICL templates and code libraries for the research community to use. ### Solutions to specific problems: 1. **Task classification and simplification**: Divide tasks into tasks that require input (Type A) and tasks that do not require input (Type B), and design special generation pipelines and simplified prompts for each type. 2. **Ensemble learning**: Select high - quality synthetic samples by integrating the outputs of multiple language models. Specifically: - Include examples generated by different language models in the final set to increase diversity. - Improve accuracy through majority voting and low - consensus filtering. ### Experimental verification: - Use multiple language models (such as T5, UL2, FALCON, etc.) for instruction generation and instance generation. - Evaluate the tuning effects of different generated data on basic models such as MPT - 7B and GPT - JT - 6B. - The results show that the data generated by Ensemble - Instruct is not only of higher quality, but also can achieve better performance on smaller datasets. ### Summary: Ensemble - Instruct aims to solve the problem that existing methods rely on large - scale language models, and provides a more efficient and open - source solution, which is suitable for researchers and developers with limited resources.