Abstract:Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques have been proposed such as Rephrase and Respond (Deng et al., 2023a), System 2 Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al., 2023). In this work we investigate self-supervised methods to ``compile'' (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1. We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance, and with less inference cost than System 2. We posit that such System 2 distillation will be an important feature of future continually learning AI systems, enabling them to focus System 2 capabilities on the reasoning tasks that they cannot yet do well.

What problem does this paper attempt to address?

The paper primarily explores how to distill the complex reasoning processes in large language models (LLMs) (referred to as System 2 reasoning) back into the model's foundational generative capabilities (referred to as System 1), thereby enhancing model performance without incurring additional computational costs. Specifically, the paper attempts to address the following issues: 1. **Reducing reasoning costs**: Many techniques that improve reasoning accuracy (such as chain-of-thought and other System 2 methods) can produce higher quality outputs but typically require more computational resources and time. Therefore, the researchers aim to find a method that can reduce the reasoning costs of these techniques while maintaining or enhancing performance. 2. **Distilling System 2 into System 1**: Through self-supervised methods, the paper attempts to directly distill high-quality answers that originally require multi-step reasoning into the model's foundational outputs, i.e., distilling from System 2 to System 1, thereby avoiding the need to generate intermediate reasoning steps. 3. **Improving model efficiency**: One of the goals of the paper is to improve the model's efficiency in performing certain tasks, especially when dealing with biased information, irrelevant information, and fine-grained evaluations, enabling the model to achieve better results at lower computational costs. To achieve these goals, the researchers employed various System 2 methods (such as Chain-of-Thought, System 2 Attention, Rephrase and Respond, Branch-Solve-Merge, etc.) and validated their performance on different tasks through experiments. They further demonstrated how these methods can be distilled back into System 1 models to enhance their performance. The paper also discusses which tasks can be successfully distilled and which are difficult to distill, providing valuable insights for future research.

Distilling System 2 into System 1

Distilling system 2 into system 1

Mixed Distillation Helps Smaller Language Models Reason Better

Mixed Distillation Helps Smaller Language Model Better Reasoning

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

Divide-or-Conquer? Which Part Should You Distill Your LLM?

Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs

Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models

SCOTT: Self-Consistent Chain-of-Thought Distillation

Implicit Chain of Thought Reasoning via Knowledge Distillation

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs

Distilling Mathematical Reasoning Capabilities into Small Language Models

LLM2: Let Large Language Models Harness System 2 Reasoning

Reinforcing Thinking through Reasoning-Enhanced Reward Models

Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language Models

Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation

Teaching Small Language Models Reasoning Through Counterfactual Distillation

Distilling Reasoning Ability from Large Language Models with Adaptive Thinking

Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Thought

Supervised Chain of Thought