Distilling System 2 into System 1

Ping Yu,Jing Xu,Jason Weston,Ilia Kulikov
2024-07-25
Abstract:Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques have been proposed such as Rephrase and Respond (Deng et al., 2023a), System 2 Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al., 2023). In this work we investigate self-supervised methods to ``compile'' (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1. We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance, and with less inference cost than System 2. We posit that such System 2 distillation will be an important feature of future continually learning AI systems, enabling them to focus System 2 capabilities on the reasoning tasks that they cannot yet do well.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores how to distill the complex reasoning processes in large language models (LLMs) (referred to as System 2 reasoning) back into the model's foundational generative capabilities (referred to as System 1), thereby enhancing model performance without incurring additional computational costs. Specifically, the paper attempts to address the following issues: 1. **Reducing reasoning costs**: Many techniques that improve reasoning accuracy (such as chain-of-thought and other System 2 methods) can produce higher quality outputs but typically require more computational resources and time. Therefore, the researchers aim to find a method that can reduce the reasoning costs of these techniques while maintaining or enhancing performance. 2. **Distilling System 2 into System 1**: Through self-supervised methods, the paper attempts to directly distill high-quality answers that originally require multi-step reasoning into the model's foundational outputs, i.e., distilling from System 2 to System 1, thereby avoiding the need to generate intermediate reasoning steps. 3. **Improving model efficiency**: One of the goals of the paper is to improve the model's efficiency in performing certain tasks, especially when dealing with biased information, irrelevant information, and fine-grained evaluations, enabling the model to achieve better results at lower computational costs. To achieve these goals, the researchers employed various System 2 methods (such as Chain-of-Thought, System 2 Attention, Rephrase and Respond, Branch-Solve-Merge, etc.) and validated their performance on different tasks through experiments. They further demonstrated how these methods can be distilled back into System 1 models to enhance their performance. The paper also discusses which tasks can be successfully distilled and which are difficult to distill, providing valuable insights for future research.