MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Zhende Song,Chenchen Wang,Jiamu Sheng,Chi Zhang,Gang Yu,Jiayuan Fan,Tao Chen
2024-06-24
Abstract:Development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In face of these challenges, we propose MovieLLM, a novel framework designed to synthesize consistent and high-quality video data for instruction tuning. The pipeline is carefully designed to control the style of videos by improving textual inversion technique with powerful text generation capability of GPT-4. As the first framework to do such thing, our approach stands out for its flexibility and scalability, empowering users to create customized movies with only one description. This makes it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of existing video understanding models' inadequacy in handling long videos (such as movies). Specifically, while current multimodal models perform well in analyzing short video clips, they face the following major challenges in understanding and processing long videos: 1. **Lack of high-quality, diverse video data**: The number of existing long video datasets is limited, and their content lacks diversity, which restricts the model's generalization ability. 2. **High cost and difficulty of data collection and annotation**: Manually collecting or annotating information in long videos (such as movie dialogues) requires a significant amount of human labor and time, making it costly. To address these challenges, the paper proposes a new framework called **MovieLLM**, which aims to enhance long video understanding by generating consistent and high-quality video data. The specific methods include: - **Movie plot generation**: Utilizing GPT-4 to generate diverse movie plots, including themes, styles, characters, and keyframe descriptions. - **Style fixation process**: Using text inversion techniques to fix style descriptions into the latent space of a diffusion model, ensuring consistent scene styles in the generated content. - **Video instruction data generation**: Combining GPT-4's generation capabilities with the style-fixed diffusion model to generate style-consistent keyframes and their corresponding question-answer pairs. Through these steps, MovieLLM can generate large-scale, high-quality long video datasets, significantly enhancing multimodal models' ability to understand and process complex video narratives.