Abstract:Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{<a class="link-external link-https" href="https://cond-image-leak.github.io/" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a key problem in Image - to - Video Diffusion Models (I2V - DMs): **Conditional Image Leakage (CIL)**. Specifically, existing I2V - DMs tend to generate less motion than expected when generating videos. This is mainly because the model over - relies on the conditional image during the diffusion process and ignores the motion information contained in the noise input. #### Specific manifestations of conditional image leakage 1. **Insufficient motion**: Regardless of the input motion score, the motion score of the generated video is always lower than expected. 2. **Time - step dependence**: As the time - steps of the diffusion process increase, the noise input becomes increasingly blurred, while the conditional image still retains a large amount of detailed information. This makes the model tend to rely on the conditional image rather than the noise input to predict motion, resulting in insufficient motion in the generated video. #### Solutions To solve the conditional image leakage problem, the author proposes improvement strategies from two aspects: inference and training: 1. **Inference strategies**: - **Start the generation process early**: By starting the video generation from an earlier time - step, avoid using unreliable larger time - steps and reduce the over - reliance on the conditional image. - **Optimal initial noise distribution (Analytic - Init)**: Optimize the initial noise distribution by minimizing the KL divergence to make it closer to the actual marginal distribution, thereby narrowing the gap between training and inference. 2. **Training strategies**: - **Time - dependent noise distribution (TimeNoise)**: During the training process, apply a time - dependent noise distribution to the conditional image, so that larger time - steps have a higher noise level, thereby disturbing the conditional image and reducing the model's reliance on it. #### Experimental verification The author conducted experiments on multiple I2V - DMs to verify the effectiveness of these strategies. The results show that the proposed method can generate videos with higher motion scores, lower errors, and maintain image alignment and temporal consistency, thereby significantly improving the overall performance. ### Summary By identifying and solving the conditional image leakage problem, this paper proposes an effective method to improve the motion control accuracy and naturalness of image - to - video generation models. This not only improves the quality of the generated videos but also provides new perspectives and methods for future research.

Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Decouple Content and Motion for Conditional Image-to-Video Generation

VIDM: Video Implicit Diffusion Models

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Video Diffusion Models

Conditional Image Synthesis with Diffusion Models: A Survey

Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

VADiffusion: Compressed Domain Information Guided Conditional Diffusion for Video Anomaly Detection

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

Lossy Image Compression with Conditional Diffusion Models

A Simple Approach to Unifying Diffusion-based Conditional Generation

Imagen Video: High Definition Video Generation with Diffusion Models

Video Diffusion Models with Local-Global Context Guidance

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models