Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

Saumya Malik
2024-09-24
Abstract:In this thesis, I evaluate the performance of Large Language Models (LLMs) on the Law School Admissions Test (LSAT), specifically the Logic Games section of the test. I focus on this section because it presents a complex logical reasoning task and thus is a valuable source of data for evaluating how modern, increasingly capable LLMs can handle hard logical reasoning tasks. I construct a dataset of LSAT logic games and their associated metadata, and extensively evaluate LLMs' performance in a Chain-of-Thought prompting setting. Given the weak performance in this setting, I explore other prompting frameworks on a smaller subset of the dataset, adapting ideas from Reflexion to this task. This results in a substantially improved accuracy of 70 percent for GPT-4 and 46 percent for GPT-3.5 on this data subset, highlighting the capacity of LLMs to revise their logical errors, despite initially weak performance. Finally, I analyze the types of logic games that models perform better or worse on, as well as the types of logical errors I observe from human annotation, providing detailed insights on the logical reasoning capabilities of LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to evaluate the performance of large language models (LLMs) on legal logical reasoning tasks, especially for the logical games section in the Law School Admission Test (LSAT). The author constructs a dataset containing LSAT logical games and their related metadata, and evaluates multiple closed - source and open - source LLMs through different prompting strategies to explore their performance on complex logical reasoning problems. Specifically, the paper attempts to solve the following problems: 1. **Dataset construction**: Construct a dataset containing all publicly available LSAT logical games and their metadata, including information such as difficulty and game type. This dataset is publicly released to encourage further research. 2. **Limitations of traditional chain - of - thought prompting**: Explore the limitations of traditional chain - of - thought (Chain - of - Thought) prompting in LSAT logical games tasks and evaluate its applicability on specific sub - datasets. 3. **Application of the reflexion framework**: Apply and implement the reflexion framework to LSAT logical games tasks. The results show that this method significantly improves the accuracy of the model. 4. **Analysis of logical reasoning ability**: Quantitatively and qualitatively analyze the performance of LLMs in different types of logical reasoning tasks, revealing the types of logic that these models are particularly good or bad at. ### Main contributions 1. **Dataset construction**: Construct and publicly release a dataset containing all publicly available LSAT logical games and their metadata, providing a valuable resource for future research. 2. **Exploration of chain - of - thought prompting**: Thoroughly explore the limitations of traditional chain - of - thought prompting in LSAT logical games tasks and its applicability on specific sub - datasets. 3. **Implementation of the reflexion framework**: Successfully apply the reflexion framework, significantly improving the accuracy of the model in logical games tasks. In particular, GPT - 4 achieves an accuracy rate of 70% and GPT - 3.5 achieves an accuracy rate of 46%. 4. **Analysis of logical reasoning ability**: Through quantitative and qualitative analysis of different types of logical games, reveal the advantages and disadvantages of LLMs in handling complex logical reasoning tasks. ### Conclusion The paper finds that the traditional multi - round chain - of - thought prompting has a relatively low overall accuracy rate in LSAT logical games tasks. Even the most advanced model such as GPT - 4 has only a 33% accuracy rate. However, by applying the reflexion framework, the model can significantly improve its accuracy, which indicates that LLMs show stronger capabilities when they have the opportunity to reflect on and correct logical errors. This finding not only demonstrates the value of the reflexion framework in evaluating LLM capabilities but also provides new ideas for future evaluation methods.