MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Zhende Song,Chenchen Wang,Jiamu Sheng,Chi Zhang,Gang Yu,Jiayuan Fan,Tao Chen

2024-06-24

Abstract:Development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In face of these challenges, we propose MovieLLM, a novel framework designed to synthesize consistent and high-quality video data for instruction tuning. The pipeline is carefully designed to control the style of videos by improving textual inversion technique with powerful text generation capability of GPT-4. As the first framework to do such thing, our approach stands out for its flexibility and scalability, empowering users to create customized movies with only one description. This makes it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of existing video understanding models' inadequacy in handling long videos (such as movies). Specifically, while current multimodal models perform well in analyzing short video clips, they face the following major challenges in understanding and processing long videos: 1. **Lack of high-quality, diverse video data**: The number of existing long video datasets is limited, and their content lacks diversity, which restricts the model's generalization ability. 2. **High cost and difficulty of data collection and annotation**: Manually collecting or annotating information in long videos (such as movie dialogues) requires a significant amount of human labor and time, making it costly. To address these challenges, the paper proposes a new framework called **MovieLLM**, which aims to enhance long video understanding by generating consistent and high-quality video data. The specific methods include: - **Movie plot generation**: Utilizing GPT-4 to generate diverse movie plots, including themes, styles, characters, and keyframe descriptions. - **Style fixation process**: Using text inversion techniques to fix style descriptions into the latent space of a diffusion model, ensuring consistent scene styles in the generated content. - **Video instruction data generation**: Combining GPT-4's generation capabilities with the style-fixed diffusion model to generate style-consistent keyframes and their corresponding question-answer pairs. Through these steps, MovieLLM can generate large-scale, high-quality long video datasets, significantly enhancing multimodal models' ability to understand and process complex video narratives.

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Understanding Long Videos with Multimodal Language Models

VideoLLM: Modeling Video Sequence with Large Language Models

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

VideoLLM-online: Online Video Large Language Model for Streaming Video

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

LongVLM: Efficient Long Video Understanding via Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images

Learning Video Context as Interleaved Multimodal Sequences

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Audio-Visual LLM for Video Understanding

Long-range Multimodal Pretraining for Movie Understanding

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Visual Context Window Extension: A New Perspective for Long Video Understanding

Video Understanding with Large Language Models: A Survey