Abstract:Text-guided image generative diffusion models achieve fast development on the generation and editing of high-quality images. To extend such success to video editing, some efforts combining image generation with video editing have been made, which however only achieve inferior performance. We attribute it to two challenges: 1) different from the static image generation, it is tricky for dynamic video information to ensure the temporal fidelity of motion consistency across different frames; 2) the randomness of the frame generation process makes it hard to continuously retain the similar spatial fidelity for the original detailed features. In this paper, we propose a new high-fidelity diffusion model-based zero-shot text-guided video editing network, called HiFiVEditor, which aims to conduct effective video editing with high fidelity of the original video’s detailed and dynamic information. Specifically, we propose a Spatial-Temporal Fidelity Block (STFB) that enables the model to restore the spatial features by enlarging the spatial perceptual field to avoid loss of important information, and capture more dynamic information between different frames by using all frames for preserving temporal consistency to achieve better temporal fidelity. In addition, we introduce Null-Text Embedding to create a soft text embedding to optimize the noise learning process, so that the latent noise can be aligned with the prompt. Furthermore, to tune the video style and render it more realistic, we employ a Prior-Guided Perceptual Loss to constrain the prediction results to avoid deviating from the original video style. Extensive experiments demonstrate the superior video editing capability compared to existing works.

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

Fine-gained Zero-shot Video Sampling

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

INVE: Interactive Neural Video Editing

VideoDirector: Precise Video Editing via Text-to-Video Models

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

High-Fidelity Diffusion Editor for Zero-Shot Text-Guided Video Editing

Motion Inversion for Video Customization

VidToMe: Video Token Merging for Zero-Shot Video Editing

Edit Temporal-Consistent Videos with Image Diffusion Model

EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

Video Snapshot: Single Image Motion Expansion Via Invertible Motion Embedding.

PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing