Abstract:We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the problem of story summarization for TV series, specifically extracting multiple key story points from long videos. Specifically, the authors propose a novel approach that utilizes recap segments from TV series to generate multimodal story summaries. ### Main Contributions 1. **Proposing the Story Summarization Task**: - This task requires identifying and extracting multiple key story points from narrative content, which is a challenging multimodal long video understanding task. 2. **Innovatively Using TV Series Recap Segments**: - The authors demonstrate how to use recap segments from TV series for video understanding and apply this to story summarization. They introduce a new dataset called PlotSnap, which contains rich recap segments from two crime thriller series, "24" and "Prison Break." 3. **Proposing a New Hierarchical Model TaleSumm**: - This model features shot and dialogue-level encoders that pass inputs to an episode-level Transformer. The model is capable of handling entire episodes while being lightweight enough to train on consumer-grade GPUs. 4. **Extensive Evaluation**: - The authors conducted thorough evaluations, including ablation studies to validate design choices. TaleSumm achieved state-of-the-art performance on the PlotSnap dataset and performed well on video summarization benchmarks. Additionally, they evaluated the model's generalization across seasons and different TV series, as well as the consistency of labels obtained from multiple sources. ### Method Overview 1. **Problem Definition**: - Extracting multimodal story summaries (video and text) from a given episode (typically around 40 minutes, covering multiple key events). This is achieved by assigning importance scores to each video shot or dialogue. 2. **Feature Extraction**: - Using three pre-trained visual backbone networks (DenseNet, MViT, and OpenAI CLIP) to capture visual diversity in shots. - For dialogues, a fine-tuned language model (such as RoBERTa-large) is used to compute context-aware word-level features. 3. **Shot and Dialogue Representation**: - Aggregating frame-level features through an attention mechanism to generate compact shot representations. - For dialogues, generating dialogue representations through simple mean pooling. 4. **Hierarchical Model TaleSumm**: - First Level: Extracting shot (dialogue) representations using frame-level (word-level) interactions. - Second Level: Capturing cross-modal interactions across the entire episode using a Transformer encoder. 5. **Training and Inference**: - The model is trained end-to-end with a binary cross-entropy (BCE) loss function to address class imbalance. - During testing, generating importance scores for each video shot and dialogue. ### Experiments and Analysis 1. **Data Splits**: - Three settings were adopted: (i) intra-season cross-validation within "24"; (ii) cross-season generalization evaluation within "24"; (iii) cross-series transfer evaluation from "24" to "Prison Break." 2. **Evaluation Metrics**: - Average Precision (AP, area under the PR curve) was used to compare predicted importance scores with ground truth labels. 3. **Experimental Results**: - TaleSumm achieved state-of-the-art performance on the PlotSnap dataset and performed well on video summarization benchmarks. Additionally, the model's generalization ability across seasons and different series was validated. ### Conclusion This paper proposes a new multimodal story summarization task by innovatively utilizing TV series recap segments and develops an effective hierarchical model, TaleSumm. These contributions not only advance research in multimodal long video understanding but also provide valuable tools for practical applications.

"Previously on ..." From Recaps to Story Summarization

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

SummScreen: A Dataset for Abstractive Screenplay Summarization

An Interactive Personalized Video Summarization Based on Sketches.

A GAN Based Video Summarization Method with Representation Loss

An Unsupervised Video Summarization Method Based on Multimodal Representation.

A Modular Approach for Multimodal Summarization of TV Shows

Remembering Winter Was Coming: Character-Oriented Video Summaries of TV Series

Previously on the Stories: Recap Snippet Identification for Story Reading

From Thumbnails to Summaries - A single Deep Neural Network to Rule Them All

Screenplay Summarization Using Latent Narrative Structure

NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization

Visual Storylines: Semantic Visualization of Movie Sequence.

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

Video summarization by redundancy removing and content ranking

ScreenWriter: Automatic Screenplay Generation and Movie Summarisation

Hierarchical3D Adapters for Long Video-to-text Summarization

Video Summarization Overview

Multi-View Video Summarization

Unsupervised video summarization framework using keyframe extraction and video skimming

Long Story Short: a Summarize-then-Search Method for Long Video Question Answering