Abstract:Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at <a class="link-external link-https" href="https://github.com/showlab/MovieSeq" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to understand and process the complex multi - modal context in narrative videos (such as movies and TV dramas). Specifically, the paper proposes a multi - modal language model named MovieSeq, aiming to flexibly respond to the diverse challenges in video understanding by representing videos as interleaved multi - modal sequences (including images, subtitles, plots and histories). ### Main problems: 1. **Situational dialogue understanding**: The model needs to process visual and audio inputs (such as dialogues) simultaneously and correlate them to generate a comprehensive description. 2. **Event dependence**: Identify the causal relationships between events in the video to help construct a coherent narrative. 3. **Utilization of external knowledge**: Integrate external information (such as character photos, plot synopses, etc.) to enhance the understanding of video content. ### Specific challenges: - **Diverse context**: Videos contain information in multiple modalities (images, subtitles, plots, histories, etc.), and a unified framework is required to process this information. - **Task diversity**: Video understanding involves multiple tasks, such as classification, description, retrieval, question - answering, etc., and each task has different requirements for the model. - **Lack of instruction - following data**: Most of the existing datasets are for specific tasks, and lack instruction - following data for complex video understanding. ### Solutions: - **Interleaved multi - modal sequence**: Combine the video and its related context (such as images, subtitles, plots, etc.) into an interleaved multi - modal sequence as the input of the model. - **Instruction tuning**: Through instruction tuning, enable the model to generate corresponding outputs according to different contexts. - **Multi - task training**: Jointly train multiple tasks in a general model to improve the generalization ability of the model. ### Experimental verification: The paper conducted experiments on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA), covering multiple tasks such as video classification, audio description, video - text retrieval, video caption generation and video question - answering. The experimental results show that MovieSeq performs excellently in multiple tasks, especially when combining multiple contexts, the performance improvement is significant. Through these methods, MovieSeq can more comprehensively understand and process the complex multi - modal information in narrative videos, providing a new solution for video understanding.

Learning Video Context as Interleaved Multimodal Sequences

Learning a Contextual Multi-Thread Model for Movie/TV Scene Segmentation

Exploring the Design Space of Visual Context Representation in Video MLLMs

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Multimodal Learning toward Micro-Video Understanding

VideoLLM: Modeling Video Sequence with Large Language Models

Long Context Transfer from Language to Vision

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Contextual AD Narration with Interleaved Multimodal Sequence

Understanding Long Videos with Multimodal Language Models

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Language as the Medium: Multimodal Video Classification through text only

Sequence Multi-Labeling: A Unified Video Annotation Scheme with Spatial and Temporal Context

Visual Context Window Extension: A New Perspective for Long Video Understanding

Text2Video: an End-to-end Learning Framework for Expressing Text with Videos

Video In-context Learning