Abstract:Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website:

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is high - fidelity image - to - video (I2V) generation. Specifically, the authors propose a new framework named AtomoVideo, which aims to generate high - quality videos from a given reference image while maintaining a high degree of consistency with the input image and the coherence of video content. The following are the specific problems that this paper attempts to solve: 1. **High - fidelity image consistency**: - The generated video needs to retain the style, content, and fine - grained details of the input image as much as possible. This is more challenging than text - to - video (T2V) generation because the I2V task requires the generated video to be visually closer to the given reference image. 2. **Enhanced motion intensity and coherence**: - While ensuring the temporal consistency between video frames, enhance the motion effects in the video. Many existing methods sacrifice the naturalness and smoothness of motion in order to improve image consistency, resulting in the generated video appearing too static. 3. **Avoid relying on noise priors**: - Many existing methods use noise priors to enhance the detail fidelity of the generated video, but this method will reduce the motion intensity. AtomoVideo attempts to achieve high - fidelity and coherent motion effects without relying on noise priors. 4. **Long - sequence video prediction**: - Expand the model to generate longer video sequences through iterative generation. Due to the limitations of GPU memory, long - video generation is a significant challenge, and AtomoVideo solves this problem by predicting subsequent frames. 5. **Flexibility and controllability**: - AtomoVideo can flexibly combine existing personalized models and controllable modules to achieve more customized video generation. For example, it can be seamlessly integrated with plugins such as ControlNet and LoRAs to adapt to different application scenarios. ### Solution overview AtomoVideo solves the above problems through the following key technical means: - **Multi - granularity image injection**: Inject image information at different levels, including low - level pixel information and high - level semantic information, to ensure the high - fidelity of the generated video. - **Zero - terminal signal - to - noise ratio and v - prediction strategy**: These training strategies improve the stability of the generation process without relying on noise priors. - **Flexible design of spatio - temporal layers**: By adding 1D temporal convolution and attention modules and only training the parameters of these newly added modules, the model can efficiently handle video generation tasks. - **Iterative frame prediction**: Predict subsequent frames by given the previous frames, achieving the generation of long - sequence videos. In summary, AtomoVideo is committed to achieving high - fidelity, strong motion effects, and good temporal consistency in image - to - video generation while maintaining the flexibility and controllability of the model.

AtomoVideo: High Fidelity Image-to-Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

Imagen Video: High Definition Video Generation with Diffusion Models

Image-to-Video Generation via 3D Facial Dynamics

LoopAnimate: Loopable Salient Object Animation

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

Make-a-video: Text-to-video generation without text-video data

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Towards Smooth Video Composition

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

AnimateAnything: Consistent and Controllable Animation for Video Generation