Abstract:While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: <a class="link-external link-https" href="https://github.com/chenqi008/LateralThinking" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenges in the evaluation and stimulation of large - language models (LLMs) in lateral thinking ability. Specifically: 1. **Limitations of existing benchmarks**: - Although current natural language processing (NLP) techniques have significantly improved the performance of LLMs in vertical - thinking tasks (such as complex reasoning and common - sense reasoning), the capabilities of these models in lateral - thinking tasks are still not fully explored and evaluated. - Existing benchmark tests mainly focus on vertical - thinking ability, ignoring lateral thinking, which leads to insufficient evaluation of creative problem - solving ability. 2. **Complexity of lateral - thinking evaluation**: - Lateral thinking involves creative and multi - angle thinking. Evaluating this ability is very complex because it is necessary to measure the creative thinking process and relevant data is scarce. - Traditional model - based evaluation methods usually rely on stronger evaluation models, which limits the effective evaluation of the latest and most advanced LLMs. 3. **Proposed new framework and benchmark**: - The paper introduces a new benchmark named SPLAT (Situation Puzzle for Lateral Thinking Assessment and Training), which uses situation puzzles to evaluate and stimulate the lateral - thinking ability of LLMs. - SPLAT contains 975 situation puzzles graded by difficulty, adopts a multi - turn player - judge framework, simulates an interactive game, and enables the model to gradually reason and solve puzzles through questions and answers. 4. **Reducing dependence on strong evaluation models**: - The new framework reduces the dependence on more powerful evaluation models, making it possible to evaluate the latest LLMs without requiring them to be weaker than the evaluation models. 5. **Experimental verification**: - The experimental results show that using a relatively powerful evaluation model (such as WizardLM - 2) can be highly consistent with human judgment in intermediate question answering and final scenario accuracy, reaching an agreement rate of more than 80%, similar to the agreement level between humans. - Applying the data and reasoning process of SPLAT to other lateral - thinking - related benchmarks (such as RiddleSense and BrainTeaser) can improve performance, indicating that SPLAT can not only evaluate but also stimulate the lateral - thinking ability of LLMs. ### Summary The core problem of the paper is: How to effectively evaluate and stimulate the ability of large - language models in lateral - thinking tasks, especially in the face of the limitations of traditional benchmark tests and the complexity of evaluation, and a new benchmark SPLAT and a multi - turn player - judge framework are proposed to solve this problem.

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

Competition-Level Problems are Effective LLM Evaluators

COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes

Eliminating Reasoning via Inferring with Planning: A New Framework to Guide LLMs' Non-linear Thinking

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

SmartPlay: A Benchmark for LLMs as Intelligent Agents

Missed Connections: Lateral Thinking Puzzles for Large Language Models