LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

Jiayi Gui,Yiming Liu,Jiale Cheng,Xiaotao Gu,Xiao Liu,Hongning Wang,Yuxiao Dong,Jie Tang,Minlie Huang
2024-10-12
Abstract:Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of evaluating large language models (LLMs) in terms of their rule-based reasoning capabilities. Although LLMs have demonstrated complex problem-solving abilities across various tasks, their evaluation as effective rule-based executors and planners remains insufficient. To fill this gap, the authors introduce a new benchmark tool—**LOGIC GAME**, aimed at comprehensively assessing LLMs' abilities in understanding, executing, and planning rules. ### Specific Issues Include: 1. **Rule Understanding and Execution**: Evaluating whether the model can understand and execute a given set of rules. 2. **Multi-step Planning**: Assessing whether the model can perform logical reasoning and planning over multiple steps. 3. **Pure Rule-based Reasoning**: Differentiating logical reasoning from mere knowledge application by designing game scenarios that rely solely on predefined rules. 4. **Intermediate Step Evaluation**: Evaluating not only the final result but also the intermediate steps in the problem-solving process to ensure the model is reasoning based on rules rather than guessing the answer. ### Solution: - **LOGIC GAME**: A benchmark tool containing various game scenarios, each with a set of rules, requiring the model to start from an initial state and gradually apply the rules to solve the problem. - **Difficulty Grading**: Game scenarios are divided into different difficulty levels, from simple rule application to complex reasoning chains, to accurately assess the model's capabilities in rule understanding and multi-step execution. - **Automatic Verification**: Intermediate steps are deterministic and can be automatically verified, ensuring the objectivity and accuracy of the evaluation. - **Bilingual Version**: Providing both Chinese and English versions to ensure fairness and applicability. ### Experimental Results: - **Performance Evaluation**: Through extensive experiments on various LLMs, it was found that even the best-performing models achieved an overall accuracy of only about 20% on complex reasoning tasks, with accuracy dropping below 10% on the highest difficulty level tasks. - **Few-shot Learning**: In execution tasks, few-shot examples can improve model performance, but in planning tasks, few-shot examples may impair model performance. ### Main Contributions: 1. **Introduction of LOGIC GAME**: A new benchmark tool for evaluating LLMs' rule-based reasoning capabilities, including execution and planning tasks, with different difficulty levels. 2. **Automated Evaluation Process**: Analyzing not only the final answer but also the solution process to comprehensively assess LLMs' reasoning abilities. 3. **Extensive Experiments**: Conducting extensive experiments on various LLMs, effectively revealing their shortcomings in rule-based reasoning, with the best model achieving an overall accuracy of about 25%. Through these contributions, the paper provides important tools and methods for evaluating and improving LLMs' rule-based reasoning capabilities.