Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei,Xuezhi Wang,Dale Schuurmans,Maarten Bosma,Brian Ichter,Fei Xia,Ed Chi,Quoc Le,Denny Zhou

2023-01-11

Abstract:We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily explores how to significantly enhance the performance of large language models on complex reasoning tasks through a method called "chain-of-thought prompting." This method guides the model to perform complex logical reasoning by showing a series of intermediate reasoning steps. Specifically, the researchers demonstrated how, without additional training, simply incorporating a few exemplary chain-of-thought processes in the prompts can enable large-scale language models to excel in various tasks such as arithmetic, common sense, and symbolic reasoning. Experimental results show that for the PaLM model with 540 billion parameters, using the chain-of-thought prompting method can significantly improve the accuracy of solving mathematical word problems on the GSM8K benchmark, surpassing the fine-tuned GPT-3 model and previous best results. Moreover, this method is not only applicable to arithmetic problems but has also been proven effective in common sense reasoning tasks (such as the CSQA dataset) and symbolic reasoning tasks (e.g., letter concatenation and coin flipping problems). The researchers demonstrated through experiments that as the model size increases, the capability of chain-of-thought prompting also enhances, particularly in handling problems that require multi-step reasoning. In summary, the paper addresses the issue of how to enable large language models to better understand and execute complex reasoning tasks and proposes a simple yet effective method—chain-of-thought prompting—to significantly improve the model's performance on such tasks.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Why Can Large Language Models Generate Correct Chain-of-Thoughts?

Large Language Models as Analogical Reasoners

Chain-of-Thought Reasoning Without Prompting

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Automatic Chain of Thought Prompting in Large Language Models

Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models

Reasoning with Large Language Models, a Survey

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Complexity-Based Prompting for Multi-Step Reasoning

Chain-of-Thought Augmentation with Logit Contrast for Enhanced Reasoning in Language Models

Boosting Language Models Reasoning with Chain-of-Knowledge Prompting

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance