Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei,Xuezhi Wang,Dale Schuurmans,Maarten Bosma,Brian Ichter,Fei Xia,Ed Chi,Quoc Le,Denny Zhou
2023-01-11
Abstract:We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores how to significantly enhance the performance of large language models on complex reasoning tasks through a method called "chain-of-thought prompting." This method guides the model to perform complex logical reasoning by showing a series of intermediate reasoning steps. Specifically, the researchers demonstrated how, without additional training, simply incorporating a few exemplary chain-of-thought processes in the prompts can enable large-scale language models to excel in various tasks such as arithmetic, common sense, and symbolic reasoning. Experimental results show that for the PaLM model with 540 billion parameters, using the chain-of-thought prompting method can significantly improve the accuracy of solving mathematical word problems on the GSM8K benchmark, surpassing the fine-tuned GPT-3 model and previous best results. Moreover, this method is not only applicable to arithmetic problems but has also been proven effective in common sense reasoning tasks (such as the CSQA dataset) and symbolic reasoning tasks (e.g., letter concatenation and coin flipping problems). The researchers demonstrated through experiments that as the model size increases, the capability of chain-of-thought prompting also enhances, particularly in handling problems that require multi-step reasoning. In summary, the paper addresses the issue of how to enable large language models to better understand and execute complex reasoning tasks and proposes a simple yet effective method—chain-of-thought prompting—to significantly improve the model's performance on such tasks.