Zhenghao Zhang,Junchao Liao,Menghao Li,Zuozhuo Dai,Bingxue Qiu,Siyu Zhu,Long Qin,Weizhi Wang
Abstract:Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. Specifically, Tora consists of a Trajectory Extractor(TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser(MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos that accurately follow designated trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora's excellence in achieving high motion fidelity, while also meticulously simulating the intricate movement of the physical world. Code is available at: <a class="link-external link-https" href="https://github.com/alibaba/Tora" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to address the challenges of controllable motion in video generation. Specifically, existing video - generation models face the following problems when generating long videos with precise motion control:
1. **Insufficient motion consistency**: Most existing methods are unable to maintain a consistent motion trajectory when generating long videos, resulting in unnatural object motion or drift in the video.
2. **Resolution and frame - number limitations**: Traditional U - Net - based models are usually only able to generate videos with a fixed resolution and for a short duration (e.g., 16 frames), and it is difficult to handle longer and higher - resolution videos.
3. **Lack of multi - modal input support**: Many existing models can only guide video generation through a single modality (such as text or image), and cannot simultaneously combine multi - modal information such as text, image, and trajectory.
To solve these problems, the authors propose Tora, a new framework based on Diffusion Transformer (DiT), specifically designed for generating videos with strong motion - control capabilities. The main innovations of Tora include:
- **Introducing the Trajectory Extractor (TE)**: Encoding any trajectory into spatio - temporal motion patches that share the same latent space with video patches, thereby better preserving motion information.
- **Designing the Motion - guidance Fuser (MGF)**: Seamlessly integrating multi - level motion conditions into DiT blocks to ensure that the generated video can accurately follow the specified trajectory.
- **Supporting multiple input conditions**: Tora can simultaneously process multiple input conditions such as text, image, and trajectory, enabling flexible and controllable video generation.
Through these innovations, Tora can maintain stable motion control when generating videos with a 720p resolution and up to 204 frames, and can accurately simulate complex motions in the physical world. Experimental results show that Tora significantly outperforms existing methods in terms of motion - control precision and video quality.
### Formula summary
To better understand the working principle of Tora, the following are some of the key formulas involved in the paper:
1. **Noise prediction objective function**:
\[
l_\epsilon = ||\epsilon - \epsilon_\theta(z_t, t, c)||_2^2
\]
where $\epsilon_\theta(\cdot)$ represents the noise prediction function of the 3D U - Net, $c$ is the conditional input, and $z_t$ is the noisy hidden state.
2. **Trajectory conversion to motion patches**:
\[
u(x_i, y_i) = x_{i + 1}-x_i; \quad v(x_i, y_i) = y_{i + 1}-y_i
\]
Here, $(x_i, y_i)$ represents the position of the trajectory in the $i$-th frame.
3. **Adaptive normalization layer**:
\[
h_i=\gamma_i\cdot h_{i - 1}+\beta_i + h_{i - 1}
\]
where $\gamma_i$ and $\beta_i$ are the scale and shift parameters converted from the motion condition $f_i$.
Through these technical means, Tora achieves fine - grained control of motion in the video - generation process and solves the problems existing in existing methods.