Abstract:Low-level 3D representations, such as point clouds, meshes, NeRFs, and 3D Gaussians, are commonly used to represent 3D objects or scenes. However, humans usually perceive 3D objects or scenes at a higher level as a composition of parts or structures rather than points or voxels. Representing 3D as semantic parts can benefit further understanding and applications. We aim to solve part-aware 3D reconstruction, which parses objects or scenes into semantic parts. In this paper, we introduce a hybrid representation of superquadrics and 2D Gaussians, trying to dig 3D structural clues from multi-view image inputs. Accurate structured geometry reconstruction and high-quality rendering are achieved at the same time. We incorporate parametric superquadrics in mesh forms into 2D Gaussians by attaching Gaussian centers to faces in meshes. During the training, superquadrics parameters are iteratively optimized, and Gaussians are deformed accordingly, resulting in an efficient hybrid representation. On the one hand, this hybrid representation inherits the advantage of superquadrics to represent different shape primitives, supporting flexible part decomposition of scenes. On the other hand, 2D Gaussians are incorporated to model the complex texture and geometry details, ensuring high-quality rendering and geometry reconstruction. The reconstruction is fully unsupervised. We conduct extensive experiments on data from DTU and ShapeNet datasets, in which the method decomposes scenes into reasonable parts, outperforming existing state-of-the-art approaches.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the problem of **part-aware 3D reconstruction**. Specifically, the authors propose a novel hybrid representation method that integrates 2D Gaussian distributions and superquadrics to parse and reconstruct different semantic parts of 3D scenes. Traditional methods typically use low-level representations such as point clouds, voxels, or meshes to reconstruct 3D objects or scenes, which do not align with the human understanding of 3D scenes. Humans usually perceive 3D objects or scenes as composed of multiple parts or structures rather than simple points or voxels. Therefore, the goal of this paper is to develop a method that can decompose 3D scenes into semantic parts, thereby better supporting tasks such as scene manipulation, editing, and scene graph generation. ### Main Contributions 1. **Novel Hybrid Representation Method**: Introduces a hybrid representation method that combines superquadrics and 2D Gaussian distributions. Superquadrics are used to model different shape primitives, while 2D Gaussian distributions capture complex textures and geometric details, ensuring high-quality rendering and geometric reconstruction. 2. **End-to-End Unsupervised Pipeline**: Proposes a fully unsupervised end-to-end pipeline for part-aware reconstruction at both block and point levels, introducing new regularization terms to simultaneously optimize superquadrics and 2D Gaussian distributions. 3. **Extensive Experimental Validation**: Conducts extensive experiments on the DTU and ShapeNet datasets, demonstrating the superiority of the proposed method in part-aware reconstruction, particularly in part segmentation and geometric detail modeling, surpassing existing state-of-the-art methods. ### Method Overview 1. **Hybrid Representation**: Combines superquadrics and 2D Gaussian distributions to form a compact hybrid representation. Each superquadric block is initialized with random parameters and gradually optimized during training. The 2D Gaussian distributions are attached to the surface of the superquadrics, sharing pose parameters to improve efficiency. 2. **Optimization Process**: Optimizes the hybrid representation by minimizing the rendering loss of multi-view images. To ensure stability and accuracy, multiple regularization terms such as coverage, overlap, simplicity, and opacity entropy are introduced. 3. **Stage-wise Optimization**: - **Block-level Optimization**: Optimizes the position and shape of the blocks through image rendering loss and multiple regularization terms, ensuring that the blocks cover meaningful areas without overlapping. - **Point-level Optimization**: Based on block-level optimization, further releases the constraints of the 2D Gaussian distributions, allowing them to move freely to fill complex areas, improving the accuracy of geometric detail modeling. ### Experimental Results Experimental results on the DTU and ShapeNet datasets show that the proposed method not only reasonably decomposes 3D scenes into different parts but also captures detailed geometric details, significantly outperforming existing state-of-the-art methods. Additionally, the method performs well when handling real data, demonstrating its potential for practical applications.

Learning Part-aware 3D Representations by Fusing 2D Gaussians and Superquadrics

A Unified Feature Representation and Learning Framework for 3D Shape

Outdoor Scene 3D Reconstruction from Multiple Point Cloud

Semantic 3D Reconstruction with Learning MVS and 2D Segmentation of Aerial Images

Shared Latent Membership Enables Joint Shape Abstraction and Segmentation With Deformable Superquadrics

PointGLR: Unsupervised Structural Representation Learning of 3D Point Clouds

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly

Parts4Feature: Learning 3D Global Features from Generally Semantic Parts in Multiple Views

Learning Semantic Representations via Joint 3D Face Reconstruction and Facial Attribute Estimation

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction

UniG: Modelling Unitary 3D Gaussians for View-consistent 3D Reconstruction

Hybrid3D: learning 3D hybrid features with point clouds and multi-view images for point cloud registration

DV-Net: Dual-view Network for 3D Reconstruction by Fusing Multiple Sets of Gated Control Point Clouds

GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction

Learning To Reconstruct High-Quality 3d Shapes With Cascaded Fully Convolutional Networks

Part123: Part-aware 3D Reconstruction from a Single-view Image

Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture