Abstract:We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement. Implemented using Answer Set Programming (ASP), LLM-ARC achieves a new state-of-the-art accuracy of 88.32% on the FOLIO benchmark which tests complex logical reasoning capabilities. Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement. We achieve our best result using a fully automated self-supervised training loop where the Actor is trained on end-to-end dialog traces with Critic feedback. We discuss potential enhancements and provide a detailed error analysis, showcasing the robustness and efficacy of LLM-ARC for complex natural language reasoning tasks.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the limitations of large language models (LLMs) in handling tasks that require precise logical reasoning. Although LLMs perform excellently in natural language understanding, their performance is often unsatisfactory in applications involving complex logical reasoning (such as tasks in the medical, legal, or financial fields). To overcome this challenge, the authors propose the **LLM-ARC** framework. ### Main Goals of LLM-ARC 1. **Enhance Logical Reasoning Ability**: Improve the logical reasoning ability of LLMs by incorporating an Automated Reasoning Critic (ARC). 2. **Generate and Verify Test Cases**: LLM-ARC not only generates logical program code but also generates test cases to verify the semantic correctness of the code. 3. **Iterative Self-Correction**: Through a self-correction loop, the system can continuously improve the code and test cases until all tests pass or the maximum number of iterations is reached. 4. **Achieve New Performance Benchmarks**: In the FOLIO benchmark test, LLM-ARC achieved an accuracy of 88.32%, surpassing the existing best result (78.9%). ### Specific Methods - **Actor-Critic Method**: - **Actor**: Uses LLM to generate declarative logical program code and test cases. - **Critic**: Uses an automated reasoning engine (such as Answer Set Programming, ASP) to execute the code and test cases and provide feedback. - **Self-Supervised Training**: Trains the Actor through end-to-end dialogue traces and Critic feedback to generate high-quality code and test cases. - **Logical Layering**: Applies logical layering to natural language statements in FOLIO to guide the generation of test cases. - **Error Correction**: Provides detailed pseudocode and strategies to help the Actor correct errors based on Critic feedback. ### Experimental Results - **Benchmark Tests**: Conducted various experiments on the FOLIO dataset, including zero-shot, few-shot, and fine-tuned LLM-only baseline systems, as well as different versions of the LLM-ARC system. - **Performance Improvement**: The LLM-ARC system showed significant improvement in performance after incorporating test case generation, especially in few-shot settings, compared to the LLM-only baseline system. - **Best Results**: The self-supervised trained LLM-ARC system achieved an accuracy of 88.32% in the FOLIO benchmark test, 10 percentage points higher than the previous best result. ### Conclusion The LLM-ARC framework significantly improves the accuracy of handling complex logical reasoning tasks by combining the generative capabilities of LLMs with the verification capabilities of automated reasoning engines. The framework achieved new best performance in the FOLIO benchmark test, demonstrating its strong potential in natural language reasoning tasks.

LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic

Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying

Enhancing Logical Reasoning in Large Language Models to Facilitate Legal Applications

Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

Beyond LLMs: Advancing the Landscape of Complex Reasoning

Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus

Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

LLM Augmentations to support Analytical Reasoning over Multiple Documents

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Intelligence Analysis of Language Models

Automated Theorem Provers Help Improve Large Language Model Reasoning

LLMs for Relational Reasoning: How Far are We?

LOGIC-LM++: Multi-Step Refinement for Symbolic Formulations

Leveraging LLM Reasoning Enhances Personalized Recommender Systems

MALT: Improving Reasoning with Multi-Agent LLM Training

Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding

Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments

Can LLMs Reason in the Wild with Programs?

ArgMed-Agents: Explainable Clinical Decision Reasoning with LLM Disscusion via Argumentation Schemes