Neural Video Fields Editing

Shuzhou Yang,Chong Mou,Jiwen Yu,Yuhan Wang,Xiandong Meng,Jian Zhang
DOI: https://doi.org/10.48550/arXiv.2312.08882
2024-03-09
Abstract:Diffusion models have revolutionized text-driven video editing. However, applying these methods to real-world editing encounters two significant challenges: (1) the rapid increase in GPU memory demand as the number of frames grows, and (2) the inter-frame inconsistency in edited videos. To this end, we propose NVEdit, a novel text-driven video editing framework designed to mitigate memory overhead and improve consistent editing for real-world long videos. Specifically, we construct a neural video field, powered by tri-plane and sparse grid, to enable encoding long videos with hundreds of frames in a memory-efficient manner. Next, we update the video field through off-the-shelf Text-to-Image (T2I) models to impart text-driven editing effects. A progressive optimization strategy is developed to preserve original temporal priors. Importantly, both the neural video field and T2I model are adaptable and replaceable, thus inspiring future research. Experiments demonstrate the ability of our approach to edit hundreds of frames with impressive inter-frame consistency. Our project is available at: <a class="link-external link-https" href="https://nvedit.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the two major challenges encountered by existing text - driven video editing methods when dealing with long videos: 1. **Rapid Increase in GPU Memory Requirements**: As the number of video frames increases, the demand for GPU memory also rises rapidly, which limits the length of the video that can be edited. 2. **Inter - frame Inconsistency**: There are inconsistent problems among different frames in the edited video, such as object deformation and texture changes, etc. This is because the existing text - to - image (T2I) models lack prior knowledge in time. To solve these problems, the paper proposes a new framework named NVEdit. This framework efficiently encodes long videos by constructing a Neural Video Field (NVF) and performs text - driven editing through off - the - shelf T2I models, thereby reducing memory overhead and improving editing consistency. Specifically, NVEdit achieves these goals through the following steps: - **Video Fitting Stage**: Construct a neural video field, using tri - plane encoding and multi - layer perceptron decoding to efficiently model the time and content priors of a given video. Due to its high - efficiency encoding, even a video with hundreds of frames can be compactly represented as a signal field. - **Field Editing Stage**: Update the trained NVF through the T2I model to endow it with text - driven editing effects. In each iteration, the NVF renders a frame and then uses the T2I model to edit it according to the provided text. These edited frames are used as pseudo Ground Truths (GTs) to optimize the parameters of the NVF. Since the NVF was initially trained on the original video, it still retains strong time priors after optimization while showing the desired editing effects. In addition, NVEdit also introduces an auxiliary mask to enhance local editing capabilities, ensuring the accuracy of the editing area without affecting the unedited area. This strategy not only improves the editing quality but also shows the adaptability and improvement potential of the T2I model in NVEdit.