Abstract:This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when generating multi - scene videos, how to ensure the quality of video content in multiple aspects (such as character consistency, artistic coherence, aesthetic quality, and alignment with the expected prompt), and effectively evaluate and correct potential problems (such as distortion or unexpected changes) brought by cross - frame character movements. Specifically, traditional short - video generation methods are difficult to meet the complex evaluation criteria required for multi - scene video generation based on continuous scenarios. ### Specific manifestations of the problem 1. **Character consistency**: Ensure that the images and behaviors of characters in different scenes remain consistent. 2. **Artistic coherence**: Ensure the consistency of the artistic style and visual effects of the video in multiple scenes. 3. **Aesthetic quality**: Ensure that the generated video has high aesthetic value. 4. **Alignment with the prompt**: The generated content should be as consistent as possible with the input text prompt. 5. **Cross - frame character movement problems**: In video generation, the movement of characters between different frames may introduce distortion or other unexpected changes. ### Solutions To solve these problems, the author proposes a scoring benchmark named **MSG (Multi - Scene Video Generation)**. This method realizes more objective and efficient evaluation through the automated scoring process, thereby selecting high - quality multi - scene videos. Specifically, MSG contains two main components: 1. **Backward and Forward Frame Reference (BFFR)**: - Consider the previous and subsequent frames when processing each frame to enhance spatial details and maintain short - term temporal consistency. - Mathematically represented as: \[ \hat{I}_t = F(I_{t - 1}, I_t, I_{t + 1}; \theta_F) \] where \(\hat{I}_t\) is the enhanced frame, and \(I_{t - 1}\) and \(I_{t + 1}\) are the previous frame and the subsequent frame respectively. 2. **Backward Scene Reference (BSR)**: - When the scene changes, use the "backtracking" mechanism to reference the key frames of the previous scene to ensure a smooth transition and long - term consistency. - Mathematically represented as: \[ \hat{I}_t = BSR(I_{prev}, I_t; \theta_B) \] where \(I_{prev}\) is the key frame of the previous scene. ### Loss function To ensure high - quality video generation, the loss function includes the following parts: - **Mean - squared - error (MSE) loss**: Used to ensure spatial fidelity. - **Temporal consistency loss**: Ensure smooth transitions between frames. - **Inter - scene consistency loss**: Penalize the differences between key frames of adjacent scenes. ### Experimental results Although the experimental design aims to verify the effectiveness of the proposed method, the actual experiment encountered some problems, resulting in inconclusive results. Therefore, the author believes that further research and re - experimentation are required to overcome these obstacles and improve the intra - frame and inter - scene temporal consistency of the video generation model. ### Summary The main contribution of this paper is to propose a new method for comprehensively evaluating multi - scene video generation. By combining short - term and long - term temporal consistency techniques, the quality of the generated video is improved. However, the experimental results have not fully verified its effectiveness, and future work will focus on improving and optimizing this method.

MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Rethinking the Evaluation of Unbiased Scene Graph Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

Animate Your Motion: Turning Still Images into Dynamic Videos

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis

What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

Unbiased Scene Graph Generation in Videos

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Multiview Scene Graph

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Diffusion-based Generation, Optimization, and Planning in 3D Scenes