Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model

Min Zhao,Hongzhou Zhu,Chendong Xiang,Kaiwen Zheng,Chongxuan Li,Jun Zhu
2024-10-03
Abstract:Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{<a class="link-external link-https" href="https://cond-image-leak.github.io/" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key problem in Image - to - Video Diffusion Models (I2V - DMs): **Conditional Image Leakage (CIL)**. Specifically, existing I2V - DMs tend to generate less motion than expected when generating videos. This is mainly because the model over - relies on the conditional image during the diffusion process and ignores the motion information contained in the noise input. #### Specific manifestations of conditional image leakage 1. **Insufficient motion**: Regardless of the input motion score, the motion score of the generated video is always lower than expected. 2. **Time - step dependence**: As the time - steps of the diffusion process increase, the noise input becomes increasingly blurred, while the conditional image still retains a large amount of detailed information. This makes the model tend to rely on the conditional image rather than the noise input to predict motion, resulting in insufficient motion in the generated video. #### Solutions To solve the conditional image leakage problem, the author proposes improvement strategies from two aspects: inference and training: 1. **Inference strategies**: - **Start the generation process early**: By starting the video generation from an earlier time - step, avoid using unreliable larger time - steps and reduce the over - reliance on the conditional image. - **Optimal initial noise distribution (Analytic - Init)**: Optimize the initial noise distribution by minimizing the KL divergence to make it closer to the actual marginal distribution, thereby narrowing the gap between training and inference. 2. **Training strategies**: - **Time - dependent noise distribution (TimeNoise)**: During the training process, apply a time - dependent noise distribution to the conditional image, so that larger time - steps have a higher noise level, thereby disturbing the conditional image and reducing the model's reliance on it. #### Experimental verification The author conducted experiments on multiple I2V - DMs to verify the effectiveness of these strategies. The results show that the proposed method can generate videos with higher motion scores, lower errors, and maintain image alignment and temporal consistency, thereby significantly improving the overall performance. ### Summary By identifying and solving the conditional image leakage problem, this paper proposes an effective method to improve the motion control accuracy and naturalness of image - to - video generation models. This not only improves the quality of the generated videos but also provides new perspectives and methods for future research.