Streaming Video Diffusion: Online Video Editing with Diffusion Models

Feng Chen,Zhen Yang,Bohan Zhuang,Qi Wu
2024-05-30
Abstract:We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the problem of online video editing. Unlike traditional offline video editing, online video editing requires real-time processing of each frame in the video stream while maintaining temporal consistency. Specifically, the paper focuses on the following key challenges: 1. **Fast Continuous Inference**: Online video editing requires fast processing of each frame, which demands efficient inference capabilities from the model. 2. **Long-term Temporal Modeling**: Online video streams usually contain long video sequences, so the model needs to handle long-term temporal dynamics. 3. **Zero-shot Video Editing Capability**: To make online video editing practical and effective, the model must be capable of editing any video without requiring specific training on the video beforehand. ### Solution To address the above challenges, the authors propose a model named "Streaming Video Diffusion" (SVDiff). The main features of SVDiff include: 1. **Recursive Spatial-aware Temporal Memory**: SVDiff introduces a learnable spatial-aware temporal memory embedding, which is recursively updated through a memory attention mechanism after each Transformer block. This memory embedding acts as a dynamic temporal cache, continuously updating to capture the details of each video frame and the motion trajectories between frames. 2. **Segmented Training Scheme**: To efficiently train long videos, SVDiff decomposes long videos into multiple short segments for training. Within each segment, the model not only updates and processes the memory but also propagates temporal memory between consecutive segments to maintain temporal historical information. 3. **Efficient Inference Process**: During the inference phase, SVDiff uses Classifier-Free Guidance (CFG) to balance fidelity and controllability. Additionally, through techniques like LCM LoRA and TensorRT, SVDiff achieves a real-time inference speed of 15.2 FPS while maintaining high-quality video generation. ### Experimental Results The paper validates the effectiveness of SVDiff through a series of experiments. Specifically: - **Quantitative Results**: SVDiff significantly outperforms baseline models in terms of temporal consistency and editing quality. For example, compared to the temporal shift method, SVDiff improves temporal consistency by 1.53% with only an additional 2,887 GFLOPs of computation. - **Qualitative Results**: Through visual comparisons, SVDiff-generated videos excel in adhering to editing prompts and maintaining temporal consistency. Other methods show noticeable deficiencies in these aspects, such as window attention and sliding window methods performing poorly on long-term dynamics, and the temporal shift method showing poor consistency in detailed textures. - **Editing Performance on Videos of Different Lengths**: SVDiff performs excellently when handling videos of different lengths, especially outperforming other methods when editing long videos. ### Conclusion By introducing recursive spatial-aware temporal memory and a segmented training scheme, SVDiff successfully addresses the key challenges in online video editing, achieving efficient and high-quality video editing. This approach has significant potential for practical applications, particularly in scenarios like live streaming and online chatting.