A Teacher Is Worth A Million Instructions

Nikhil Kothari,Ravindra Nayak,Shreyas Shetty,Amey Patil,Nikesh Garera

2024-06-27

Abstract:Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model's ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to $7.9$ in MT-Bench and $93.04\%$ on AlpacaEval.

Machine Learning

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Challenges in Training Large Language Models (LLMs)**: The paper points out that although large language models perform excellently, there are many challenges in training these models, including data quality, the selection of instruction tuning sets, and inherent limitations of training methods. These issues are particularly significant for models with smaller parameter scales (such as models with 7B and 13B parameters). 2. **Application of Knowledge Distillation in Model Training**: The researchers propose an improved training method that uses the knowledge of larger-scale models to guide the training of smaller-scale models. Specifically, they use a mixture of experts architecture (such as the 8x7B architecture) as the teacher model and transfer knowledge to the student model through knowledge distillation (KD). 3. **Domain Alignment from Expert (DAE) Algorithm**: The paper proposes a novel post-training domain alignment algorithm that combines expert models from specific domains to enhance the model's knowledge in specific areas while maintaining its generalization ability. This method not only improves the model's performance on public benchmarks but also achieves significant results in the e-commerce domain. Through the above methods, the researchers successfully improved the performance of smaller models and surpassed larger parameter models in multiple tasks. Experimental results demonstrate the effectiveness of knowledge distillation and the domain alignment algorithm, providing new ideas and methods for training large language models.

A Teacher Is Worth A Million Instructions

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

Pedagogical Alignment of Large Language Models

Maybe Only 0.5 Training Data Instruction Tuning

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

CITING: Large Language Models Create Curriculum for Instruction Tuning

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models

Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

Towards Modeling Learner Performance with Large Language Models

Large Language Models are In-context Teachers for Knowledge Reasoning

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages