Abstract:Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving fine control over complex dynamics in video synthesis, especially the subtle movements between multiple interacting objects. Existing video generation models face significant challenges in handling the appearance and disappearance of objects, drastic scale changes, and ensuring instance consistency across frames. These issues limit the practicality of video generation in applications that require high realism and controllability, such as advanced scene simulation and training of perception systems. To overcome these challenges, the paper proposes the TrackDiffusion framework, which achieves motion control under trajectory conditions through a diffusion model, thereby precisely manipulating object trajectories and interactions, overcoming the limitations of existing models in terms of scale and continuity. Specifically, the main contributions of the paper include: 1. **First Application**: Proposes a trajectory-conditioned video generation method that generates continuous video sequences directly from trajectory segments, a capability not possessed by existing video generation models. 2. **Instance Enhancer**: Introduces a new component—the instance enhancer, which ensures object consistency across frames under complex conditions such as occlusion and rapid movement. 3. **Experimental Validation**: Experimental results show that by introducing trajectory constraints, the quality of generated videos is significantly improved, and the TrackAP (Track Average Precision) score of the target tracker is greatly enhanced, demonstrating the effectiveness of motion control. Through these innovations, the paper aims to enhance the realism and controllability of video generation, providing better support for applications such as advanced scene simulation and perception system training.

TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

MV-Diffusion: Motion-aware Video Diffusion Model

FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

Video Diffusion Models

Accelerating Video Diffusion Models via Distribution Matching

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Video Diffusion Models are Training-free Motion Interpreter and Controller

MoVideo: Motion-Aware Video Generation with Diffusion Models

TVG: A Training-free Transition Video Generation Method with Diffusion Models

DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

DINTR: Tracking via Diffusion-based Interpolation