An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Konstantin Hebenstreit,Robert Praas,Louis P Kiesewetter,Matthias Samwald

2023-08-03

Abstract:Emergent chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs). However, uncertainties remain about how reasoning strategies formulated for previous model generations generalize to new model generations and different datasets. In this small-scale study, we compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge) on a mixture of six question-answering datasets, including datasets from scientific and medical domains. Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets. GPT-4 has the most benefit from current state-of-the-art reasoning strategies and exhibits the best performance by applying a prompt previously discovered through automated discovery.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore and evaluate the generality and effectiveness of zero-shot chain-of-thought (CoT) prompting strategies in the latest large language models (LLMs). Specifically: 1. **Generality Evaluation**: Researchers tested the effectiveness of different chain-of-thought strategies on multiple latest large language models using zero-shot prompting methods to verify whether these strategies can maintain consistent performance across models and datasets. 2. **Model Selection**: The study selected 6 latest language models for experiments, including Davinci-002, Davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl, and Cohere command-xlarge. 3. **Dataset Coverage**: The experiments covered 6 question-answering datasets, including datasets from the fields of common sense, science, and medicine, to comprehensively evaluate the performance of chain-of-thought strategies. 4. **Results Analysis**: The results showed that although there are certain performance differences between different models and datasets, the overall performance of chain-of-thought strategies remains robust. In particular, GPT-4 performed best when applying automatically discovered prompting strategies. In summary, this paper mainly focuses on the generality and effectiveness of chain-of-thought prompting strategies in the latest language models and provides corresponding evidence through empirical research.

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

A comparison of chain-of-thought reasoning strategies across datasets and models

Automatic Chain of Thought Prompting in Large Language Models

Chain-of-Thought Reasoning Without Prompting

Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models

Supervised Chain of Thought

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Stress Testing Chain-of-Thought Prompting for Large Language Models

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Think Beyond Size: Adaptive Prompting for More Effective Reasoning

Zero-Shot Chain-of-Thought Reasoning Guided by Evolutionary Algorithms in Large Language Models

Instance-adaptive Zero-shot Chain-of-Thought Prompting

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data