Abstract:Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system's ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that large - language models (LLMs) perform poorly in tasks requiring lateral thinking. Specifically, the author focuses on how to improve the ability of LLMs in solving creative and non - traditional problems by optimizing Chain - of - Thought (CoT) prompts. #### Background and problem description 1. **Limitations of existing research**: - Current research mainly focuses on the performance of LLMs in logical reasoning tasks, while relatively little research has been done on lateral thinking tasks. - Lateral thinking tasks require the model to have the ability to solve problems creatively "out of the box", which is a challenge for existing LLMs. 2. **BRAIN TEASER shared task**: - The BRAIN TEASER shared task has designed two types of subtasks: sentence puzzles and word puzzles, which are used to test and evaluate the lateral thinking ability of the model. - This task uses adversarial datasets to prevent the model from relying on memory rather than reasoning, resulting in poor performance of the model on these tasks. 3. **Core of the problem**: - How to optimize CoT prompts so that LLMs can better understand and solve problems requiring creative thinking. - Through iterative optimization of prompts, combined with human evaluation, identify the specific challenges encountered by the model in the reasoning process and improve the prompts in a targeted manner. #### Proposed solutions The author proposes a method for iteratively optimizing CoT prompts, which mainly includes the following steps: 1. **Randomly sample training data and generate initial CoT prompts**: - Randomly select samples from the training set to generate initial CoT prompts. 2. **Identify different categories in output reasoning and divide training data**: - Divide the training data into different categories according to the type of reasoning output by the model. 3. **Independent human evaluation**: - Conduct independent human evaluation on the data of each category to identify specific reasoning challenges. 4. **Develop new CoT prompts based on evaluation results**: - Use the results of human evaluation to develop new CoT prompts, especially for those options that are likely to mislead the model. 5. **Optional data collection / synthesis improvement**: - Identify deficiencies in the data and provide guidance for future data collection and synthesis. Through this method, the author not only improves the performance of the model on adversarial datasets, but also provides valuable insights for the creation of future datasets.

Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts

FtG-CoT at SemEval-2024 Task 9: Solving Sentence Puzzles Using Fine-Tuned Language Models and Zero-Shot CoT Prompting

The Unreasonable Effectiveness of Eccentric Automatic Prompts

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning

POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Think Beyond Size: Adaptive Prompting for More Effective Reasoning

Automatic Chain of Thought Prompting in Large Language Models

Chain of Thought Prompt Tuning in Vision Language Models

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

Iteratively Prompt Pre-trained Language Models for Chain of Thought

Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models

Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models