Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang,David C. Parkes
DOI: https://doi.org/10.48550/arXiv.2309.08589
2023-11-09
Abstract:Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore whether large - language models (LLMs) can learn new skills through self - education instead of relying solely on a large amount of human - generated training data. Specifically, the paper introduces SECToR (Self - Education via Chain - of - Thought Reasoning), which is a proof - of - concept demonstration showing that language models can self - learn new skills through chain - of - thought reasoning. #### Main problems: 1. **Self - education ability**: Although current large - language models have demonstrated surprising new capabilities, they lack the ability to learn new skills autonomously and still rely on a large amount of existing human - generated data. The question raised in the paper is whether these models can learn new skills through self - education. 2. **Data depletion problem**: As large - language models consume high - quality text data, the available data on the Internet gradually decreases, leading to the data depletion problem. If language models can effectively learn from their own generated data, this may usher in a new era driven solely by computing without relying on the amount of external data. 3. **Error avalanche problem**: During self - training, when all training data is generated by the model itself, small errors may accumulate and intensify rapidly, resulting in a serious decline in performance. The paper proposes how to alleviate this problem through methods such as consistency checks. #### Specific tasks: The paper selects addition as a benchmark task because addition is a basic mathematical task, but historically language models have not performed well in this regard. The research objective is to verify whether the language model can learn addition of higher - digit numbers through self - education when it has only seen addition of a small number of digits. #### Key assumptions: The core assumption of the paper is that chain - of - thought reasoning can be used as a policy improvement operator, similar to the Monte - Carlo Tree Search used in AlphaZero. In this way, the model can gradually improve its performance without external real - data. #### Experimental results: The experimental results show that after supervised fine - tuning, the 582M - parameter ByT5 model can successfully learn addition of up to 29 - digit numbers without using any real data exceeding 6 digits, with an accuracy rate of over 98%. In addition, the smaller 300M - parameter version of the ByT5 model also shows similar successful results. In conclusion, what this paper attempts to solve is how to enable large - language models to have the ability to learn new skills autonomously, and this possibility has been verified through the addition task.