pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan,Sizhe Li,Andrea Tagliasacchi,Vincent Sitzmann
2024-04-05
Abstract:We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Main Problems Addressed by the Paper This paper primarily addresses the problem of general novel view synthesis from sparse image observations. Specifically: 1. **Efficient Rendering and Training**: To tackle the issues of high memory and time consumption in existing differentiable rendering methods, a feedforward model `pixelSplat` is proposed. This model enables real-time and memory-efficient rendering, as well as fast 3D reconstruction. 2. **Overcoming Local Minima**: To address the problem of representations based on 3D Gaussian primitives easily falling into local minima, a method is proposed to predict the probability density of 3D Gaussian distributions. By using reparameterization techniques, the sampling operation becomes differentiable, allowing gradients to backpropagate through the Gaussian splatting representation. 3. **Solving Scale Ambiguity**: To solve the issue of camera poses in real-world datasets being reconstructed up to an arbitrary scale factor, a multi-view bilinear transformer is designed to reliably infer the scale factor for each scene. 4. **Generating Editable 3D Representations**: Unlike methods that focus solely on accelerating rendering without reconstructing interpretable or editable 3D scene representations, this model can reconstruct interpretable and editable 3D radiance fields from image pairs. 5. **Performance Improvement**: When performing novel view synthesis on real-world datasets such as RealEstate10k and ACID, this method outperforms state-of-the-art light field transformers in terms of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS), and achieves a rendering speed improvement of three orders of magnitude. ### Technical Innovations - **Pixel-Aligned 3D Gaussian Primitive Prediction**: By predicting the positional probability distribution of 3D Gaussian primitives for each pixel, rather than directly predicting the position itself, the problem of local minima is avoided. - **Multi-View Bilinear Transformer**: By utilizing surface correspondences between dual views and combining depth information from positional encoding, the issue of scale ambiguity is resolved. - **Different Parameterization and Sampling Strategies**: By setting the opacity of Gaussian primitives equal to the probability of the sampled depth bucket, the sampling operation becomes differentiable, allowing for effective model training.