Abstract:Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.

What problem does this paper attempt to address?

The core issue this paper attempts to address is the evaluation and generation of story-level analogies. Specifically, the researchers constructed a large-scale story-level analogy corpus named STORY ANALOGY, which contains 24,000 pairs of stories from different domains. These story pairs were annotated using the extended Structure-Mapping Theory (SMT) to assess their degree of entity similarity and relational similarity. Through this corpus, the researchers designed a series of tests to evaluate, for the first time, the ability to recognize and generate story-level analogies. ### Main Issues: 1. **Evaluating the capabilities of existing models**: The study found that existing sentence embedding models and large language models (such as ChatGPT and LLaMa) perform poorly in recognizing story analogies. ChatGPT's accuracy in multiple-choice questions was only around 30%, far below the human accuracy of over 85%. 2. **Generating high-quality analogies**: The researchers also found that using data from STORY ANALOGY can improve the quality of analogies generated by large language models. The fine-tuned FlanT5-xxl model achieved performance in analogy generation comparable to zero-shot ChatGPT. ### Solutions: - **Constructing a large-scale corpus**: The researchers collected a large number of story pairs from multiple domains (including scientific scripts, social commonsense stories, word-level analogies, and knowledge graph triples) and obtained human annotations of entity similarity and relational similarity through crowdsourcing. - **Extending the Structure-Mapping Theory**: The researchers extended SMT to the story level and proposed an analogy scoring method based on entity similarity and relational similarity to quantify the degree of analogy between stories. - **Evaluating model performance**: Through various evaluation methods (including STS-style evaluation and multiple-choice question evaluation), the researchers systematically assessed the performance of existing models on the story analogy task and identified the shortcomings of current models. - **Improving model performance**: The researchers significantly improved the performance of baseline models on analogy recognition and generation tasks through fine-tuning and few-shot learning methods. ### Significance: - **Advancing analogy understanding research**: This study provides important data and evaluation benchmarks for the understanding and generation of story analogies, which helps to advance further research in related fields. - **Enhancing model capabilities**: The research results show that model performance on complex cognitive tasks can be improved through specific data and methods by fine-tuning and improving models. In summary, this paper reveals the shortcomings of existing models in the story analogy task by constructing a large-scale story analogy corpus and systematic evaluation methods, and proposes methods for improvement, providing an important foundation for future research.

StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding

AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

Beneath Surface Similarity: Large Language Models Make Reasonable Scientific Analogies after Structure Abduction

Understanding Narratives through Dimensions of Analogy

ARN: Analogical Reasoning on Narratives

Past Meets Present: Creating Historical Analogy with Large Language Models

ANALOGYKB: Unlocking Analogical Reasoning of Language Models with A Million-scale Knowledge Base

ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Do large language models solve verbal analogies like children do?

Why Do We Need Neuro-symbolic AI to Model Pragmatic Analogies?

Solving morphological analogies: from retrieval to generation

Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT

Are Large Language Models Capable of Generating Human-Level Narratives?

Fluid Transformers and Creative Analogies: Exploring Large Language Models' Capacity for Augmenting Cross-Domain Analogical Creativity

Tackling Morphological Analogies Using Deep Learning -- Extended Version

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Life is a Circus and We are the Clowns: Automatically Finding Analogies between Situations and Processes

Beyond Numbers: Creating Analogies to Enhance Data Comprehension and Communication with Generative AI

In-Context Analogical Reasoning with Pre-Trained Language Models