Abstract:Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With early access to the GPT-4 API we are able to conduct intense experiments on the GPT-4 model. The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets. We release the prompt-style logical reasoning datasets as a benchmark suite and name it LogiEval.

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the performance of ChatGPT and GPT-4 on logical reasoning tasks. Specifically, the authors aim to test the capabilities of these two models in multiple-choice reading comprehension and natural language inference tasks using various logical reasoning datasets, and to analyze their strengths and limitations. ### Main Issues of the Paper: 1. **Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4**: - The authors aim to understand the performance of these models on different logical reasoning tasks, especially when dealing with known datasets and newly released, out-of-distribution datasets. 2. **Comparing the Performance of ChatGPT and GPT-4**: - Through experiments, the authors hope to compare the performance of these two models on logical reasoning tasks, particularly in comparison to traditional fine-tuning methods (such as RoBERTa). 3. **Exploring the Performance of Models on Different Tasks**: - The authors not only tested multiple-choice reading comprehension tasks but also tested natural language inference tasks to comprehensively evaluate the logical reasoning abilities of these models. ### Specific Goals: - **Constructing Logical Reasoning Datasets**: - The authors constructed a new logical reasoning dataset, LogiEval, to test prompt-based large-scale language models. - **Analyzing Model Performance on Different Datasets**: - The authors selected several popular logical reasoning datasets, such as LogiQA, ReClor, ConTRoL, etc., as well as the newly released dataset AR-LSAT, to evaluate the performance of ChatGPT and GPT-4. - **Exploring the Contextual Learning Ability of Models**: - The authors studied the performance of GPT-4 after seeing examples multiple times within the same conversation window to evaluate its contextual learning ability. - **Exploring the Effect of Chain-of-Thought Prompts**: - The authors tried Chain-of-Thought (CoT) prompts to evaluate the impact of this prompting method on GPT-4's performance in logical reasoning tasks. ### Summary: The paper aims to comprehensively evaluate the performance of ChatGPT and GPT-4 on logical reasoning tasks using multiple logical reasoning datasets, particularly their ability to handle known datasets and newly released, out-of-distribution datasets. Additionally, by constructing new datasets and exploring different prompting methods, the authors further investigate the performance and limitations of these models.

Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

GLoRE: Evaluating Logical Reasoning of Large Language Models

GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts

GPT-4 Can't Reason

Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

LogiCoT: Logical Chain-of-Thought Instruction-Tuning

Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models

Complementary Advantages of ChatGPTs and Human Readers in Reasoning: Evidence from English Text Reading Comprehension

Uncovering ChatGPT's Capabilities in Recommender Systems

Evaluating the ChatGPT family of models for biomedical reasoning and classification

Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate

LogiGAN: Learning Logical Reasoning via Adversarial Pre-training

Assessing GPT4-V on Structured Reasoning Tasks

LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

ChatLogic: Integrating Logic Programming with Large Language Models for Multi-Step Reasoning

Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

GPTEval: A Survey on Assessments of ChatGPT and GPT-4