Kevin Qinghong Lin,Pengchuan Zhang,Difei Gao,Xide Xia,Joya Chen,Ziteng Gao,Jinheng Xie,Xuhong Xiao,Mike Zheng Shou
Abstract:Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at <a class="link-external link-https" href="https://github.com/showlab/MovieSeq" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to understand and process the complex multi - modal context in narrative videos (such as movies and TV dramas). Specifically, the paper proposes a multi - modal language model named MovieSeq, aiming to flexibly respond to the diverse challenges in video understanding by representing videos as interleaved multi - modal sequences (including images, subtitles, plots and histories).
### Main problems:
1. **Situational dialogue understanding**: The model needs to process visual and audio inputs (such as dialogues) simultaneously and correlate them to generate a comprehensive description.
2. **Event dependence**: Identify the causal relationships between events in the video to help construct a coherent narrative.
3. **Utilization of external knowledge**: Integrate external information (such as character photos, plot synopses, etc.) to enhance the understanding of video content.
### Specific challenges:
- **Diverse context**: Videos contain information in multiple modalities (images, subtitles, plots, histories, etc.), and a unified framework is required to process this information.
- **Task diversity**: Video understanding involves multiple tasks, such as classification, description, retrieval, question - answering, etc., and each task has different requirements for the model.
- **Lack of instruction - following data**: Most of the existing datasets are for specific tasks, and lack instruction - following data for complex video understanding.
### Solutions:
- **Interleaved multi - modal sequence**: Combine the video and its related context (such as images, subtitles, plots, etc.) into an interleaved multi - modal sequence as the input of the model.
- **Instruction tuning**: Through instruction tuning, enable the model to generate corresponding outputs according to different contexts.
- **Multi - task training**: Jointly train multiple tasks in a general model to improve the generalization ability of the model.
### Experimental verification:
The paper conducted experiments on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA), covering multiple tasks such as video classification, audio description, video - text retrieval, video caption generation and video question - answering. The experimental results show that MovieSeq performs excellently in multiple tasks, especially when combining multiple contexts, the performance improvement is significant.
Through these methods, MovieSeq can more comprehensively understand and process the complex multi - modal information in narrative videos, providing a new solution for video understanding.