Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Qi Chen,Bowen Zhang,Gang Wang,Qi Wu
2024-10-09
Abstract:While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: <a class="link-external link-https" href="https://github.com/chenqi008/LateralThinking" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the challenges in the evaluation and stimulation of large - language models (LLMs) in lateral thinking ability. Specifically: 1. **Limitations of existing benchmarks**: - Although current natural language processing (NLP) techniques have significantly improved the performance of LLMs in vertical - thinking tasks (such as complex reasoning and common - sense reasoning), the capabilities of these models in lateral - thinking tasks are still not fully explored and evaluated. - Existing benchmark tests mainly focus on vertical - thinking ability, ignoring lateral thinking, which leads to insufficient evaluation of creative problem - solving ability. 2. **Complexity of lateral - thinking evaluation**: - Lateral thinking involves creative and multi - angle thinking. Evaluating this ability is very complex because it is necessary to measure the creative thinking process and relevant data is scarce. - Traditional model - based evaluation methods usually rely on stronger evaluation models, which limits the effective evaluation of the latest and most advanced LLMs. 3. **Proposed new framework and benchmark**: - The paper introduces a new benchmark named SPLAT (Situation Puzzle for Lateral Thinking Assessment and Training), which uses situation puzzles to evaluate and stimulate the lateral - thinking ability of LLMs. - SPLAT contains 975 situation puzzles graded by difficulty, adopts a multi - turn player - judge framework, simulates an interactive game, and enables the model to gradually reason and solve puzzles through questions and answers. 4. **Reducing dependence on strong evaluation models**: - The new framework reduces the dependence on more powerful evaluation models, making it possible to evaluate the latest LLMs without requiring them to be weaker than the evaluation models. 5. **Experimental verification**: - The experimental results show that using a relatively powerful evaluation model (such as WizardLM - 2) can be highly consistent with human judgment in intermediate question answering and final scenario accuracy, reaching an agreement rate of more than 80%, similar to the agreement level between humans. - Applying the data and reasoning process of SPLAT to other lateral - thinking - related benchmarks (such as RiddleSense and BrainTeaser) can improve performance, indicating that SPLAT can not only evaluate but also stimulate the lateral - thinking ability of LLMs. ### Summary The core problem of the paper is: How to effectively evaluate and stimulate the ability of large - language models in lateral - thinking tasks, especially in the face of the limitations of traditional benchmark tests and the complexity of evaluation, and a new benchmark SPLAT and a multi - turn player - judge framework are proposed to solve this problem.