Abstract:Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore whether large - language models (LLMs) can learn new skills through self - education instead of relying solely on a large amount of human - generated training data. Specifically, the paper introduces SECToR (Self - Education via Chain - of - Thought Reasoning), which is a proof - of - concept demonstration showing that language models can self - learn new skills through chain - of - thought reasoning. #### Main problems: 1. **Self - education ability**: Although current large - language models have demonstrated surprising new capabilities, they lack the ability to learn new skills autonomously and still rely on a large amount of existing human - generated data. The question raised in the paper is whether these models can learn new skills through self - education. 2. **Data depletion problem**: As large - language models consume high - quality text data, the available data on the Internet gradually decreases, leading to the data depletion problem. If language models can effectively learn from their own generated data, this may usher in a new era driven solely by computing without relying on the amount of external data. 3. **Error avalanche problem**: During self - training, when all training data is generated by the model itself, small errors may accumulate and intensify rapidly, resulting in a serious decline in performance. The paper proposes how to alleviate this problem through methods such as consistency checks. #### Specific tasks: The paper selects addition as a benchmark task because addition is a basic mathematical task, but historically language models have not performed well in this regard. The research objective is to verify whether the language model can learn addition of higher - digit numbers through self - education when it has only seen addition of a small number of digits. #### Key assumptions: The core assumption of the paper is that chain - of - thought reasoning can be used as a policy improvement operator, similar to the Monte - Carlo Tree Search used in AlphaZero. In this way, the model can gradually improve its performance without external real - data. #### Experimental results: The experimental results show that after supervised fine - tuning, the 582M - parameter ByT5 model can successfully learn addition of up to 29 - digit numbers without using any real data exceeding 6 digits, with an accuracy rate of over 98%. In addition, the smaller 300M - parameter version of the ByT5 model also shows similar successful results. In conclusion, what this paper attempts to solve is how to enable large - language models to have the ability to learn new skills autonomously, and this possibility has been verified through the addition task.

Chain-of-Thought Reasoning is a Policy Improvement Operator

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Thought

Rethinking Chain-of-Thought from the Perspective of Self-Training

Implicit Chain of Thought Reasoning via Knowledge Distillation

Break the Chain: Large Language Models Can be Shortcut Reasoners

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

Chain-of-Thought Augmentation with Logit Contrast for Enhanced Reasoning in Language Models

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Self-Consistency Improves Chain of Thought Reasoning in Language Models

A comparison of chain-of-thought reasoning strategies across datasets and models

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Multimodal Chain-of-Thought Reasoning in Language Models

Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

Chain-of-Thought Reasoning Without Prompting

Deductive Verification of Chain-of-Thought Reasoning