AtomoVideo: High Fidelity Image-to-Video Generation

Litong Gong,Yiran Zhu,Weijie Li,Xiaoyang Kang,Biao Wang,Tiezheng Ge,Bo Zheng
2024-03-05
Abstract:Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is high - fidelity image - to - video (I2V) generation. Specifically, the authors propose a new framework named AtomoVideo, which aims to generate high - quality videos from a given reference image while maintaining a high degree of consistency with the input image and the coherence of video content. The following are the specific problems that this paper attempts to solve: 1. **High - fidelity image consistency**: - The generated video needs to retain the style, content, and fine - grained details of the input image as much as possible. This is more challenging than text - to - video (T2V) generation because the I2V task requires the generated video to be visually closer to the given reference image. 2. **Enhanced motion intensity and coherence**: - While ensuring the temporal consistency between video frames, enhance the motion effects in the video. Many existing methods sacrifice the naturalness and smoothness of motion in order to improve image consistency, resulting in the generated video appearing too static. 3. **Avoid relying on noise priors**: - Many existing methods use noise priors to enhance the detail fidelity of the generated video, but this method will reduce the motion intensity. AtomoVideo attempts to achieve high - fidelity and coherent motion effects without relying on noise priors. 4. **Long - sequence video prediction**: - Expand the model to generate longer video sequences through iterative generation. Due to the limitations of GPU memory, long - video generation is a significant challenge, and AtomoVideo solves this problem by predicting subsequent frames. 5. **Flexibility and controllability**: - AtomoVideo can flexibly combine existing personalized models and controllable modules to achieve more customized video generation. For example, it can be seamlessly integrated with plugins such as ControlNet and LoRAs to adapt to different application scenarios. ### Solution overview AtomoVideo solves the above problems through the following key technical means: - **Multi - granularity image injection**: Inject image information at different levels, including low - level pixel information and high - level semantic information, to ensure the high - fidelity of the generated video. - **Zero - terminal signal - to - noise ratio and v - prediction strategy**: These training strategies improve the stability of the generation process without relying on noise priors. - **Flexible design of spatio - temporal layers**: By adding 1D temporal convolution and attention modules and only training the parameters of these newly added modules, the model can efficiently handle video generation tasks. - **Iterative frame prediction**: Predict subsequent frames by given the previous frames, achieving the generation of long - sequence videos. In summary, AtomoVideo is committed to achieving high - fidelity, strong motion effects, and good temporal consistency in image - to - video generation while maintaining the flexibility and controllability of the model.