Abstract:Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B--40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) Categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) Ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful outputs than their larger un-tuned counterparts. Our codebase is available at <a class="link-external link-https" href="https://github.com/IBM/ensemble-instruct" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use smaller - scale language models (with the number of parameters between 10B - 40B) to generate high - quality instruction - tuning data, so as to replace the methods that rely on large - scale language models (such as the model with 175B parameters). Specifically, the authors explore how to achieve this goal on smaller and publicly available language models through Ensemble Learning and improved In - Context Learning (ICL) techniques. ### Main contributions of the paper: 1. **Propose a new algorithm**: Ensemble - Instruct, which can generate high - quality instruction - tuning data on smaller - scale language models. 2. **Improve the Self - Instruct method**: By classifying and simplifying ICL templates and integrating the outputs of multiple language models, the quality of the generated data is improved. 3. **Verify the effectiveness of the method**: Experimental results show that the data generated using Ensemble - Instruct can significantly improve the performance of small - scale language models, and even exceed the effect of the data generated using large - scale language models. 4. **Release a synthetic dataset**: Provide approximately 45,000 synthetic samples as well as corresponding ICL templates and code libraries for the research community to use. ### Solutions to specific problems: 1. **Task classification and simplification**: Divide tasks into tasks that require input (Type A) and tasks that do not require input (Type B), and design special generation pipelines and simplified prompts for each type. 2. **Ensemble learning**: Select high - quality synthetic samples by integrating the outputs of multiple language models. Specifically: - Include examples generated by different language models in the final set to increase diversity. - Improve accuracy through majority voting and low - consensus filtering. ### Experimental verification: - Use multiple language models (such as T5, UL2, FALCON, etc.) for instruction generation and instance generation. - Evaluate the tuning effects of different generated data on basic models such as MPT - 7B and GPT - JT - 6B. - The results show that the data generated by Ensemble - Instruct is not only of higher quality, but also can achieve better performance on smaller datasets. ### Summary: Ensemble - Instruct aims to solve the problem that existing methods rely on large - scale language models, and provides a more efficient and open - source solution, which is suitable for researchers and developers with limited resources.

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

AgentTuning: Enabling Generalized Agent Abilities for LLMs

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

Maybe Only 0.5 Training Data Instruction Tuning

Demystifying Instruction Mixing for Fine-tuning Large Language Models

SelectLLM: Can LLMs Select Important Instructions to Annotate?

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Towards Robust Instruction Tuning on Multimodal Large Language Models

Instruction Tuning With Loss Over Instructions

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

InstructEval: Systematic Evaluation of Instruction Selection Methods

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Smaller Language Models Are Better Instruction Evolvers