Abstract:Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at <a class="link-external link-https" href="https://github.com/fangli333/SC-4DGS" rel="external noopener nofollow">this https URL</a>.

MegaScenes: Scene-Level View Synthesis at Scale

Learning 3 D Scene Synthesis from Annotated RGB-D Images

Benchmarking Large-Scale Multi-View 3D Reconstruction Using Realistic Synthetic Images

MINERVAS: Massive INterior EnviRonments VirtuAl Synthesis

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Self-supervised novel 2D view synthesis of large-scale scenes with efficient multi-scale voxel carving

MegaDepth: Learning Single-View Depth Prediction from Internet Photos

Large-Scale Indoor Visual-Geometric Multimodal Dataset and Benchmark for Novel View Synthesis

CompNVS: Novel View Synthesis with Scene Completion

Dynamic scene novel view synthesis via deferred spatio-temporal consistency

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction

OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos

XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold

WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

Holistic Understanding of 3D Scenes as Universal Scene Description

NViST: In the Wild New View Synthesis from a Single Image with Transformers