LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic

Aditya Kalyanpur,Kailash Karthik Saravanakumar,Victor Barres,Jennifer Chu-Carroll,David Melville,David Ferrucci
2024-07-19
Abstract:We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement. Implemented using Answer Set Programming (ASP), LLM-ARC achieves a new state-of-the-art accuracy of 88.32% on the FOLIO benchmark which tests complex logical reasoning capabilities. Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement. We achieve our best result using a fully automated self-supervised training loop where the Actor is trained on end-to-end dialog traces with Critic feedback. We discuss potential enhancements and provide a detailed error analysis, showcasing the robustness and efficacy of LLM-ARC for complex natural language reasoning tasks.
Computation and Language,Artificial Intelligence,Logic in Computer Science
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the limitations of large language models (LLMs) in handling tasks that require precise logical reasoning. Although LLMs perform excellently in natural language understanding, their performance is often unsatisfactory in applications involving complex logical reasoning (such as tasks in the medical, legal, or financial fields). To overcome this challenge, the authors propose the **LLM-ARC** framework. ### Main Goals of LLM-ARC 1. **Enhance Logical Reasoning Ability**: Improve the logical reasoning ability of LLMs by incorporating an Automated Reasoning Critic (ARC). 2. **Generate and Verify Test Cases**: LLM-ARC not only generates logical program code but also generates test cases to verify the semantic correctness of the code. 3. **Iterative Self-Correction**: Through a self-correction loop, the system can continuously improve the code and test cases until all tests pass or the maximum number of iterations is reached. 4. **Achieve New Performance Benchmarks**: In the FOLIO benchmark test, LLM-ARC achieved an accuracy of 88.32%, surpassing the existing best result (78.9%). ### Specific Methods - **Actor-Critic Method**: - **Actor**: Uses LLM to generate declarative logical program code and test cases. - **Critic**: Uses an automated reasoning engine (such as Answer Set Programming, ASP) to execute the code and test cases and provide feedback. - **Self-Supervised Training**: Trains the Actor through end-to-end dialogue traces and Critic feedback to generate high-quality code and test cases. - **Logical Layering**: Applies logical layering to natural language statements in FOLIO to guide the generation of test cases. - **Error Correction**: Provides detailed pseudocode and strategies to help the Actor correct errors based on Critic feedback. ### Experimental Results - **Benchmark Tests**: Conducted various experiments on the FOLIO dataset, including zero-shot, few-shot, and fine-tuned LLM-only baseline systems, as well as different versions of the LLM-ARC system. - **Performance Improvement**: The LLM-ARC system showed significant improvement in performance after incorporating test case generation, especially in few-shot settings, compared to the LLM-only baseline system. - **Best Results**: The self-supervised trained LLM-ARC system achieved an accuracy of 88.32% in the FOLIO benchmark test, 10 percentage points higher than the previous best result. ### Conclusion The LLM-ARC framework significantly improves the accuracy of handling complex logical reasoning tasks by combining the generative capabilities of LLMs with the verification capabilities of automated reasoning engines. The framework achieved new best performance in the FOLIO benchmark test, demonstrating its strong potential in natural language reasoning tasks.