Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Melanie Subbiah,Sean Zhang,Lydia B. Chilton,Kathleen McKeown
2024-07-12
Abstract:We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to evaluate the performance of large language models (LLMs) in summarizing short stories, particularly those with complex plots, non-linear timelines, or obscure backgrounds. The research focuses on the following aspects: 1. **Fidelity**: Whether the model can accurately convey the details of the story without generating errors. 2. **Specificity**: Whether the model can capture specific details of the story rather than being vague. 3. **Thematic Analysis**: Whether the model can correctly understand and explain the themes and deeper meanings of the story. 4. **Coherence**: Whether the summaries generated by the model are coherent, fluent, and easy to read. To ensure fairness and accuracy in the evaluation, the researchers collaborated directly with authors, selecting some stories that were not publicly available online and asking the authors to evaluate the summaries generated by the models. Through quantitative and qualitative analysis, the paper compares the performance of GPT-4, Claude-2.1, and LLama-2-70B. ### Main Findings 1. **Model Performance**: - GPT-4 and Claude-2.1 were able to generate excellent summaries in most cases, but about 50% of the summaries contained errors. - LLama-2-70B performed significantly worse across all attributes. 2. **Specific Issues**: - The models had difficulties with specificity and explaining obscure backgrounds. - Unreliable narrators and complex storylines posed challenges to the models' summarization abilities. - A considerable portion of the summaries generated by the models lacked support in their analysis or completely misunderstood the emotions and actions in the story. 3. **Automatic Evaluation Metrics**: - The study found that existing automatic evaluation metrics (such as ROUGE and BERTScore) had low correlation with the authors' ratings and could not replace human expert judgment. ### Conclusion The paper highlights the limitations of large language models in handling complex narrative texts, especially in explaining obscure backgrounds and thematic analysis. Although these models can generate high-quality summaries in some cases, they still need further improvement to enhance accuracy and reliability. Additionally, the research shows that collaborating with professional writers for evaluation is a key method to ensure the validity of the evaluation results.