Abstract:As large language models (LLMs) have demonstrated impressive multiple step-by-step reasoning capabilities in recent natural language processing (NLP) reasoning tasks, many studies are interested in distilling reasoning abilities into smaller language models (SLMs) via fine-tuning. Previous distillation methods usually utilize the capabilities of LLMs to generate chain-of-thought (CoT) samples to teach SLMs. However, this distillation approach performs poorly in certain scenarios due to the limitations of CoT. In this work, we introduce a novel Mixed Distillation (MD) framework, distilling multiple step-by-step reasoning abilities into SLMs. First, we leverage LLMs to generate multiple step-by-step reasoning rationales by sampling automatically. Then, we create high-quality, well-balanced mixed thought data and design a novel multi-task loss to help SLMs better learn and adaptively activate multiple step-by-step reasoning. Our extensive experiments demonstrate that MD enhances both single-path (using either CoT or PoT) and multi-path (using both CoT and PoT) reasoning abilities of SLMs during inference across reasoning tasks. Notably, a single model generated by MD exceeds the comprehensive performance of an ensemble of two individual CoT and PoT distilled models. Mistral-7B using MD can achieve remarkable improvements of 87.5%, 74.0% and 77.1% on SVAMP, GSM8K and ASDIV, respectively, outperforming the teacher model, GPT-3.5-Turbo. We hope our work provides insight into SLMs’ multiple step-by-step reasoning abilities.

Effective Distillation of Table-based Reasoning Ability from LLMs

Mixed Distillation Helps Smaller Language Model Better Reasoning

Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation

Mixed Distillation Helps Smaller Language Models Reason Better

Distilling Mathematical Reasoning Capabilities into Small Language Models

Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation

Distilling LLMs' Decomposition Abilities into Compact Language Models

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks

Teaching Small Language Models Reasoning Through Counterfactual Distillation

Divide-or-Conquer? Which Part Should You Distill Your LLM?

Democratizing Reasoning Ability: Tailored Learning from Large Language Model

Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data

TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

Rethinking Tabular Data Understanding with Large Language Models

PaD: Program-aided Distillation Can Teach Small Models Reasoning Better Than Chain-of-thought Fine-tuning

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA

Large Language Models are few(1)-shot Table Reasoners

LLAVADI: What Matters For Multimodal Large Language Models Distillation