Abstract:Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbf{S$^2$AG-Vid}, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S$^2$AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in text - to - video (T2V) generation models, when a scene contains multiple objects with different actions, existing T2V models have difficulty in correctly aligning the subjects in the text description and their corresponding actions. Specifically, there are two main problems: 1. **Mismatch in the number of subjects**: The number of subjects in the generated video does not match the text description. For example, given the prompt "A man is walking and a dog is running", the generated video may contain one man and two or more dogs, or only one man. 2. **Incorrect action binding**: Even if the number of subjects is correct, it is still challenging to correctly assign actions to the corresponding subjects. For example, given the prompt "A man is skateboarding and a dog is sitting", the generated video may show both the man and the dog skateboarding, or the man standing while the dog is skateboarding. To solve these problems, the authors propose the S2AG - Vid method, which enhances the alignment of multiple subjects and multiple actions by introducing space - and syntax - aware attention guidance. Specifically, S2AG - Vid applies cross - attention constraints based on spatial positions in the early stage of the denoising process and introduces constraints based on syntactic contrast in the subsequent stage to improve the association between verbs and nouns, thereby improving the consistency of subjects and actions. ### Formula summary - **Space - aware constraints**: \[ L_{fg}=\frac{1}{F}\sum_{i,j\in S^*}\sum_{f\in F}\left(1 - \frac{A_f^i\cdot M_f^i}{A_f^i}\right)^2 \] \[ L_{bg}=\frac{1}{F}\sum_{i,j\in S^*}\sum_{f\in F}\left(\frac{A_f^i\cdot(1 - M_f^i)}{A_f^i}\right)^2 \] \[ L_{sp}=\lambda_{fg}L_{fg}+\lambda_{bg}L_{bg} \] - **Syntax - aware constraints**: \[ L_{pos}(s_i^*)=\frac{1}{F}\sum_{f\in F}\text{dist}(A_f^i,A_f^j) \] \[ L_{neg}(s_i^*,U_i)=\frac{1}{F}\sum_{u\in U_i}\sum_{f\in F}\text{dist}(A_f^i,A_f^u) \] \[ L_{syt}=\sum_{s_i^*\in S^*}\frac{L_{pos}(s_i^*)}{L_{pos}(s_i^*)+L_{neg}(s_i^*,U_i)} \] These formulas are respectively used to ensure that the subjects are concentrated in the specified area (space - aware constraints) and to enhance the association between verbs and nouns (syntax - aware constraints). Through these methods, S2AG - Vid significantly improves the alignment quality of multiple subjects and multiple actions.

S$^2$AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Motion Control for Enhanced Complex Action Video Generation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Edit Temporal-Consistent Videos with Image Diffusion Model

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Motion Guided Spatial Attention for Video Captioning.

MoVideo: Motion-Aware Video Generation with Diffusion Models

Motion Guided Region Message Passing for Video Captioning