S$^2$AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

Yuanhang Li,Qi Mao,Lan Chen,Zhen Fang,Lei Tian,Xinyan Xiao,Libiao Jin,Hua Wu
2024-09-24
Abstract:Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbf{S$^2$AG-Vid}, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S$^2$AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in text - to - video (T2V) generation models, when a scene contains multiple objects with different actions, existing T2V models have difficulty in correctly aligning the subjects in the text description and their corresponding actions. Specifically, there are two main problems: 1. **Mismatch in the number of subjects**: The number of subjects in the generated video does not match the text description. For example, given the prompt "A man is walking and a dog is running", the generated video may contain one man and two or more dogs, or only one man. 2. **Incorrect action binding**: Even if the number of subjects is correct, it is still challenging to correctly assign actions to the corresponding subjects. For example, given the prompt "A man is skateboarding and a dog is sitting", the generated video may show both the man and the dog skateboarding, or the man standing while the dog is skateboarding. To solve these problems, the authors propose the S2AG - Vid method, which enhances the alignment of multiple subjects and multiple actions by introducing space - and syntax - aware attention guidance. Specifically, S2AG - Vid applies cross - attention constraints based on spatial positions in the early stage of the denoising process and introduces constraints based on syntactic contrast in the subsequent stage to improve the association between verbs and nouns, thereby improving the consistency of subjects and actions. ### Formula summary - **Space - aware constraints**: \[ L_{fg}=\frac{1}{F}\sum_{i,j\in S^*}\sum_{f\in F}\left(1 - \frac{A_f^i\cdot M_f^i}{A_f^i}\right)^2 \] \[ L_{bg}=\frac{1}{F}\sum_{i,j\in S^*}\sum_{f\in F}\left(\frac{A_f^i\cdot(1 - M_f^i)}{A_f^i}\right)^2 \] \[ L_{sp}=\lambda_{fg}L_{fg}+\lambda_{bg}L_{bg} \] - **Syntax - aware constraints**: \[ L_{pos}(s_i^*)=\frac{1}{F}\sum_{f\in F}\text{dist}(A_f^i,A_f^j) \] \[ L_{neg}(s_i^*,U_i)=\frac{1}{F}\sum_{u\in U_i}\sum_{f\in F}\text{dist}(A_f^i,A_f^u) \] \[ L_{syt}=\sum_{s_i^*\in S^*}\frac{L_{pos}(s_i^*)}{L_{pos}(s_i^*)+L_{neg}(s_i^*,U_i)} \] These formulas are respectively used to ensure that the subjects are concentrated in the specified area (space - aware constraints) and to enhance the association between verbs and nouns (syntax - aware constraints). Through these methods, S2AG - Vid significantly improves the alignment quality of multiple subjects and multiple actions.