Abstract:In this thesis, I evaluate the performance of Large Language Models (LLMs) on the Law School Admissions Test (LSAT), specifically the Logic Games section of the test. I focus on this section because it presents a complex logical reasoning task and thus is a valuable source of data for evaluating how modern, increasingly capable LLMs can handle hard logical reasoning tasks. I construct a dataset of LSAT logic games and their associated metadata, and extensively evaluate LLMs' performance in a Chain-of-Thought prompting setting. Given the weak performance in this setting, I explore other prompting frameworks on a smaller subset of the dataset, adapting ideas from Reflexion to this task. This results in a substantially improved accuracy of 70 percent for GPT-4 and 46 percent for GPT-3.5 on this data subset, highlighting the capacity of LLMs to revise their logical errors, despite initially weak performance. Finally, I analyze the types of logic games that models perform better or worse on, as well as the types of logical errors I observe from human annotation, providing detailed insights on the logical reasoning capabilities of LLMs.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to evaluate the performance of large language models (LLMs) on legal logical reasoning tasks, especially for the logical games section in the Law School Admission Test (LSAT). The author constructs a dataset containing LSAT logical games and their related metadata, and evaluates multiple closed - source and open - source LLMs through different prompting strategies to explore their performance on complex logical reasoning problems. Specifically, the paper attempts to solve the following problems: 1. **Dataset construction**: Construct a dataset containing all publicly available LSAT logical games and their metadata, including information such as difficulty and game type. This dataset is publicly released to encourage further research. 2. **Limitations of traditional chain - of - thought prompting**: Explore the limitations of traditional chain - of - thought (Chain - of - Thought) prompting in LSAT logical games tasks and evaluate its applicability on specific sub - datasets. 3. **Application of the reflexion framework**: Apply and implement the reflexion framework to LSAT logical games tasks. The results show that this method significantly improves the accuracy of the model. 4. **Analysis of logical reasoning ability**: Quantitatively and qualitatively analyze the performance of LLMs in different types of logical reasoning tasks, revealing the types of logic that these models are particularly good or bad at. ### Main contributions 1. **Dataset construction**: Construct and publicly release a dataset containing all publicly available LSAT logical games and their metadata, providing a valuable resource for future research. 2. **Exploration of chain - of - thought prompting**: Thoroughly explore the limitations of traditional chain - of - thought prompting in LSAT logical games tasks and its applicability on specific sub - datasets. 3. **Implementation of the reflexion framework**: Successfully apply the reflexion framework, significantly improving the accuracy of the model in logical games tasks. In particular, GPT - 4 achieves an accuracy rate of 70% and GPT - 3.5 achieves an accuracy rate of 46%. 4. **Analysis of logical reasoning ability**: Through quantitative and qualitative analysis of different types of logical games, reveal the advantages and disadvantages of LLMs in handling complex logical reasoning tasks. ### Conclusion The paper finds that the traditional multi - round chain - of - thought prompting has a relatively low overall accuracy rate in LSAT logical games tasks. Even the most advanced model such as GPT - 4 has only a 33% accuracy rate. However, by applying the reflexion framework, the model can significantly improve its accuracy, which indicates that LLMs show stronger capabilities when they have the opportunity to reflect on and correct logical errors. This finding not only demonstrates the value of the reflexion framework in evaluating LLM capabilities but also provides new ideas for future evaluation methods.

Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

LLMs for Relational Reasoning: How Far are We?

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

Enhancing Logical Reasoning in Large Language Models to Facilitate Legal Applications

Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

Evaluating the Deductive Competence of Large Language Models

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

How susceptible are LLMs to Logical Fallacies?

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models

Can LLMs Reason in the Wild with Programs?

Reliable Reasoning Beyond Natural Language

Conditional and Modal Reasoning in Large Language Models

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

GLoRE: Evaluating Logical Reasoning of Large Language Models