Splatter a Video: Video Gaussian Representation for Versatile Processing

Yang-Tian Sun,Yi-Hua Huang,Lin Ma,Xiaoyang Lyu,Yan-Pei Cao,Xiaojuan Qi
2024-06-26
Abstract:Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: <a class="link-external link-https" href="https://sunyangtian.github.io/spatter_a_video_web/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in video processing, especially the challenges related to video representation. Specifically: 1. **Modeling of complex motions**: Existing methods face difficulties in dealing with complex motions, either due to the lack of 3D structure or relying on implicit 3D representations that are not suitable for manipulation tasks. 2. **Object occlusion problem**: Current methods perform poorly in handling object occlusions (especially complex self - occlusions), leading to error propagation and problems in editing tasks. 3. **Tasks requiring 3D information**: Many video processing tasks (such as consistent depth prediction, stereo video generation, etc.) require 3D information, while existing methods have limited or no ability in this regard. To solve these problems, the authors propose a new explicit 3D representation method - **Video Gaussian Representation (VGR)**. This method embeds the video into a 3D Gaussian distribution, models the video appearance through an explicit Gaussian distribution proxy, and associates each Gaussian distribution with 3D motion attributes to control its position at different time steps, thereby achieving the modeling of video motion. Specifically, the main contributions of VGR include: - Providing a more intrinsic and explicit representation than layered atlases or volumetric pixel matrices. - Solving this ill - posed problem by regularizing learning from 2D priors (such as optical flow and depth) extracted from the base model. - Demonstrating a wide range of applications in various video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereo video generation. Through these improvements, VGR can better handle complex motions, occlusions, and noise in videos while maintaining temporal consistency, thus providing more powerful support for various video processing tasks.