A Teacher Is Worth A Million Instructions

Nikhil Kothari,Ravindra Nayak,Shreyas Shetty,Amey Patil,Nikesh Garera
2024-06-27
Abstract:Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model's ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to $7.9$ in MT-Bench and $93.04\%$ on AlpacaEval.
Machine Learning
What problem does this paper attempt to address?
The paper mainly addresses the following issues: 1. **Challenges in Training Large Language Models (LLMs)**: The paper points out that although large language models perform excellently, there are many challenges in training these models, including data quality, the selection of instruction tuning sets, and inherent limitations of training methods. These issues are particularly significant for models with smaller parameter scales (such as models with 7B and 13B parameters). 2. **Application of Knowledge Distillation in Model Training**: The researchers propose an improved training method that uses the knowledge of larger-scale models to guide the training of smaller-scale models. Specifically, they use a mixture of experts architecture (such as the 8x7B architecture) as the teacher model and transfer knowledge to the student model through knowledge distillation (KD). 3. **Domain Alignment from Expert (DAE) Algorithm**: The paper proposes a novel post-training domain alignment algorithm that combines expert models from specific domains to enhance the model's knowledge in specific areas while maintaining its generalization ability. This method not only improves the model's performance on public benchmarks but also achieves significant results in the e-commerce domain. Through the above methods, the researchers successfully improved the performance of smaller models and surpassed larger parameter models in multiple tasks. Experimental results demonstrate the effectiveness of knowledge distillation and the domain alignment algorithm, providing new ideas and methods for training large language models.