An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Konstantin Hebenstreit,Robert Praas,Louis P Kiesewetter,Matthias Samwald
2023-08-03
Abstract:Emergent chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs). However, uncertainties remain about how reasoning strategies formulated for previous model generations generalize to new model generations and different datasets. In this small-scale study, we compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge) on a mixture of six question-answering datasets, including datasets from scientific and medical domains. Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets. GPT-4 has the most benefit from current state-of-the-art reasoning strategies and exhibits the best performance by applying a prompt previously discovered through automated discovery.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore and evaluate the generality and effectiveness of zero-shot chain-of-thought (CoT) prompting strategies in the latest large language models (LLMs). Specifically: 1. **Generality Evaluation**: Researchers tested the effectiveness of different chain-of-thought strategies on multiple latest large language models using zero-shot prompting methods to verify whether these strategies can maintain consistent performance across models and datasets. 2. **Model Selection**: The study selected 6 latest language models for experiments, including Davinci-002, Davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl, and Cohere command-xlarge. 3. **Dataset Coverage**: The experiments covered 6 question-answering datasets, including datasets from the fields of common sense, science, and medicine, to comprehensively evaluate the performance of chain-of-thought strategies. 4. **Results Analysis**: The results showed that although there are certain performance differences between different models and datasets, the overall performance of chain-of-thought strategies remains robust. In particular, GPT-4 performed best when applying automatically discovered prompting strategies. In summary, this paper mainly focuses on the generality and effectiveness of chain-of-thought prompting strategies in the latest language models and provides corresponding evidence through empirical research.