StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding

Cheng Jiayang,Lin Qiu,Tsz Ho Chan,Tianqing Fang,Weiqi Wang,Chunkit Chan,Dongyu Ru,Qipeng Guo,Hongming Zhang,Yangqiu Song,Yue Zhang,Zheng Zhang
2023-10-23
Abstract:Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.
Computation and Language
What problem does this paper attempt to address?
The core issue this paper attempts to address is the evaluation and generation of story-level analogies. Specifically, the researchers constructed a large-scale story-level analogy corpus named STORY ANALOGY, which contains 24,000 pairs of stories from different domains. These story pairs were annotated using the extended Structure-Mapping Theory (SMT) to assess their degree of entity similarity and relational similarity. Through this corpus, the researchers designed a series of tests to evaluate, for the first time, the ability to recognize and generate story-level analogies. ### Main Issues: 1. **Evaluating the capabilities of existing models**: The study found that existing sentence embedding models and large language models (such as ChatGPT and LLaMa) perform poorly in recognizing story analogies. ChatGPT's accuracy in multiple-choice questions was only around 30%, far below the human accuracy of over 85%. 2. **Generating high-quality analogies**: The researchers also found that using data from STORY ANALOGY can improve the quality of analogies generated by large language models. The fine-tuned FlanT5-xxl model achieved performance in analogy generation comparable to zero-shot ChatGPT. ### Solutions: - **Constructing a large-scale corpus**: The researchers collected a large number of story pairs from multiple domains (including scientific scripts, social commonsense stories, word-level analogies, and knowledge graph triples) and obtained human annotations of entity similarity and relational similarity through crowdsourcing. - **Extending the Structure-Mapping Theory**: The researchers extended SMT to the story level and proposed an analogy scoring method based on entity similarity and relational similarity to quantify the degree of analogy between stories. - **Evaluating model performance**: Through various evaluation methods (including STS-style evaluation and multiple-choice question evaluation), the researchers systematically assessed the performance of existing models on the story analogy task and identified the shortcomings of current models. - **Improving model performance**: The researchers significantly improved the performance of baseline models on analogy recognition and generation tasks through fine-tuning and few-shot learning methods. ### Significance: - **Advancing analogy understanding research**: This study provides important data and evaluation benchmarks for the understanding and generation of story analogies, which helps to advance further research in related fields. - **Enhancing model capabilities**: The research results show that model performance on complex cognitive tasks can be improved through specific data and methods by fine-tuning and improving models. In summary, this paper reveals the shortcomings of existing models in the story analogy task by constructing a large-scale story analogy corpus and systematic evaluation methods, and proposes methods for improvement, providing an important foundation for future research.