Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Hanmeng Liu,Ruoxi Ning,Zhiyang Teng,Jian Liu,Qiji Zhou,Yue Zhang
2023-05-05
Abstract:Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With early access to the GPT-4 API we are able to conduct intense experiments on the GPT-4 model. The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets. We release the prompt-style logical reasoning datasets as a benchmark suite and name it LogiEval.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the performance of ChatGPT and GPT-4 on logical reasoning tasks. Specifically, the authors aim to test the capabilities of these two models in multiple-choice reading comprehension and natural language inference tasks using various logical reasoning datasets, and to analyze their strengths and limitations. ### Main Issues of the Paper: 1. **Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4**: - The authors aim to understand the performance of these models on different logical reasoning tasks, especially when dealing with known datasets and newly released, out-of-distribution datasets. 2. **Comparing the Performance of ChatGPT and GPT-4**: - Through experiments, the authors hope to compare the performance of these two models on logical reasoning tasks, particularly in comparison to traditional fine-tuning methods (such as RoBERTa). 3. **Exploring the Performance of Models on Different Tasks**: - The authors not only tested multiple-choice reading comprehension tasks but also tested natural language inference tasks to comprehensively evaluate the logical reasoning abilities of these models. ### Specific Goals: - **Constructing Logical Reasoning Datasets**: - The authors constructed a new logical reasoning dataset, LogiEval, to test prompt-based large-scale language models. - **Analyzing Model Performance on Different Datasets**: - The authors selected several popular logical reasoning datasets, such as LogiQA, ReClor, ConTRoL, etc., as well as the newly released dataset AR-LSAT, to evaluate the performance of ChatGPT and GPT-4. - **Exploring the Contextual Learning Ability of Models**: - The authors studied the performance of GPT-4 after seeing examples multiple times within the same conversation window to evaluate its contextual learning ability. - **Exploring the Effect of Chain-of-Thought Prompts**: - The authors tried Chain-of-Thought (CoT) prompts to evaluate the impact of this prompting method on GPT-4's performance in logical reasoning tasks. ### Summary: The paper aims to comprehensively evaluate the performance of ChatGPT and GPT-4 on logical reasoning tasks using multiple logical reasoning datasets, particularly their ability to handle known datasets and newly released, out-of-distribution datasets. Additionally, by constructing new datasets and exploring different prompting methods, the authors further investigate the performance and limitations of these models.