Abstract:Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g. GPT4 solves only 1.4%. SearchBench problems require considering multiple pathways to the solution as well as backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that in-context learning with A* algorithm implementations enhances performance. The full potential of this promoting approach emerges when combined with our proposed Multi-Stage-Multi-Try method, which breaks down the algorithm implementation into two stages and verifies the first stage against unit tests, raising GPT-4's performance above 57%.

What problem does this paper attempt to address?

The paper aims to address the challenges faced by large language models (LLMs) when dealing with logical problems and puzzles, which are relatively easy for humans but difficult for the current state-of-the-art LLMs to solve. Specifically, the paper addresses this issue through the following aspects: 1. **Introduction of a New Benchmark**: A new benchmark test set, SearchBench, is proposed, containing 11 unique search problems. Each problem is equipped with an automated pipeline to generate any number of instances and analyze the feasibility, correctness, and optimality of the solutions generated by LLMs. 2. **Evaluation of Current LLMs Performance**: It is demonstrated that even the most advanced LLMs (such as GPT-4) perform poorly in solving these end-to-end text-form problems. For example, GPT-4 can only solve 1.4% of these problems. 3. **Improvement Method**: The paper proposes an improved method that leverages the context learning ability of the A* algorithm, combined with the proposed Multi-Stage-Multi-Try (MSMT) method. This approach decomposes the algorithm implementation into two stages and verifies the results of the first stage through unit testing, significantly improving GPT-4's performance to over 57%. 4. **Exploration of LLMs' Nonlinear Reasoning Ability**: By designing search problems that require consideration of multiple paths and backtracking, the paper tests the performance of LLMs in nonlinear reasoning. 5. **Comprehensive Evaluation**: Unlike other existing benchmarks, SearchBench not only evaluates the correctness of the solutions but also focuses on their feasibility and optimality, providing a comprehensive understanding of LLMs' performance in solving such problems. In summary, the main goal of the paper is to deeply study and enhance the ability and performance of LLMs in solving complex search problems by introducing a new benchmark test set and improved strategies.

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Are Your LLMs Capable of Stable Reasoning?

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Reliable Reasoning Beyond Natural Language

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Competition-Level Problems are Effective LLM Evaluators

Are Large-Language Models Graph Algorithmic Reasoners?

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Benchmarking Large Language Models for Math Reasoning Tasks

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

Can LLMs Reason in the Wild with Programs?

Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying

A NotSo Simple Way to Beat Simple Bench

Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games