Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Nasim Borazjanizadeh,Roei Herzig,Trevor Darrell,Rogerio Feris,Leonid Karlinsky
2024-06-18
Abstract:Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g. GPT4 solves only 1.4%. SearchBench problems require considering multiple pathways to the solution as well as backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that in-context learning with A* algorithm implementations enhances performance. The full potential of this promoting approach emerges when combined with our proposed Multi-Stage-Multi-Try method, which breaks down the algorithm implementation into two stages and verifies the first stage against unit tests, raising GPT-4's performance above 57%.
Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the challenges faced by large language models (LLMs) when dealing with logical problems and puzzles, which are relatively easy for humans but difficult for the current state-of-the-art LLMs to solve. Specifically, the paper addresses this issue through the following aspects: 1. **Introduction of a New Benchmark**: A new benchmark test set, SearchBench, is proposed, containing 11 unique search problems. Each problem is equipped with an automated pipeline to generate any number of instances and analyze the feasibility, correctness, and optimality of the solutions generated by LLMs. 2. **Evaluation of Current LLMs Performance**: It is demonstrated that even the most advanced LLMs (such as GPT-4) perform poorly in solving these end-to-end text-form problems. For example, GPT-4 can only solve 1.4% of these problems. 3. **Improvement Method**: The paper proposes an improved method that leverages the context learning ability of the A* algorithm, combined with the proposed Multi-Stage-Multi-Try (MSMT) method. This approach decomposes the algorithm implementation into two stages and verifies the results of the first stage through unit testing, significantly improving GPT-4's performance to over 57%. 4. **Exploration of LLMs' Nonlinear Reasoning Ability**: By designing search problems that require consideration of multiple paths and backtracking, the paper tests the performance of LLMs in nonlinear reasoning. 5. **Comprehensive Evaluation**: Unlike other existing benchmarks, SearchBench not only evaluates the correctness of the solutions but also focuses on their feasibility and optimality, providing a comprehensive understanding of LLMs' performance in solving such problems. In summary, the main goal of the paper is to deeply study and enhance the ability and performance of LLMs in solving complex search problems by introducing a new benchmark test set and improved strategies.