Abstract:We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to evaluate the performance of large language models (LLMs) in summarizing short stories, particularly those with complex plots, non-linear timelines, or obscure backgrounds. The research focuses on the following aspects: 1. **Fidelity**: Whether the model can accurately convey the details of the story without generating errors. 2. **Specificity**: Whether the model can capture specific details of the story rather than being vague. 3. **Thematic Analysis**: Whether the model can correctly understand and explain the themes and deeper meanings of the story. 4. **Coherence**: Whether the summaries generated by the model are coherent, fluent, and easy to read. To ensure fairness and accuracy in the evaluation, the researchers collaborated directly with authors, selecting some stories that were not publicly available online and asking the authors to evaluate the summaries generated by the models. Through quantitative and qualitative analysis, the paper compares the performance of GPT-4, Claude-2.1, and LLama-2-70B. ### Main Findings 1. **Model Performance**: - GPT-4 and Claude-2.1 were able to generate excellent summaries in most cases, but about 50% of the summaries contained errors. - LLama-2-70B performed significantly worse across all attributes. 2. **Specific Issues**: - The models had difficulties with specificity and explaining obscure backgrounds. - Unreliable narrators and complex storylines posed challenges to the models' summarization abilities. - A considerable portion of the summaries generated by the models lacked support in their analysis or completely misunderstood the emotions and actions in the story. 3. **Automatic Evaluation Metrics**: - The study found that existing automatic evaluation metrics (such as ROUGE and BERTScore) had low correlation with the authors' ratings and could not replace human expert judgment. ### Conclusion The paper highlights the limitations of large language models in handling complex narrative texts, especially in explaining obscure backgrounds and thematic analysis. Although these models can generate high-quality summaries in some cases, they still need further improvement to enhance accuracy and reliability. Additionally, the research shows that collaborating with professional writers for evaluation is a key method to ensure the validity of the evaluation results.

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Benchmarking Large Language Models for News Summarization

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

On Learning to Summarize with Large Language Models as References

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Evaluating Factual Consistency of Summaries with Large Language Models

FABLES: Evaluating faithfulness and content selection in book-length summarization

BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Source Code Summarization in the Era of Large Language Models

Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms

Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

Can Large Language Models Serve as Evaluators for Code Summarization?

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Analyzing the Performance of Large Language Models on Code Summarization

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports