Abstract:We propose a novel unsupervised method to learn the pose and part-segmentation of articulated objects with rigid parts. Given two observations of an object in different articulation states, our method learns the geometry and appearance of object parts by using an implicit model from the first observation, distils the part segmentation and articulation from the second observation while rendering the latter observation. Additionally, to tackle the complexities in the joint optimization of part segmentation and articulation, we propose a voxel grid-based initialization strategy and a decoupled optimization procedure. Compared to the prior unsupervised work, our model obtains significantly better performance, and generalizes to objects with multiple parts while it can be efficiently from few views for the latter observation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to learn part segmentation and pose of movable objects with rigid parts in an unsupervised manner from observations in two different poses. Specifically, given two sets of observation data (each set contains images from multiple viewpoints) of an object in different poses, the author proposes a novel method to learn the geometries, appearances of each part of the object and their relative motions. ### Main Problems 1. **Part Segmentation and Pose Estimation**: How to automatically identify each rigid part of an object from unlabeled data and estimate the motion parameters (such as rotation axes and translations) of these parts. 2. **Multi - View Synthesis**: How to generate images in new viewpoints by adjusting the poses of parts based on the known object model in one pose. 3. **Optimization Complexity**: How to handle the complex dependencies between part segmentation and pose estimation to ensure that the model can converge stably and obtain high - quality results. ### Solutions To solve the above problems, the author proposes the following methods: - **Implicit Model and NeRF**: First, use NeRF (Neural Radiance Field) to learn the geometry and appearance of the object from a set of images in a fixed pose. Then freeze these parameters so that the geometry and appearance remain unchanged in subsequent steps. - **Conditional View Synthesis**: By introducing a bottleneck layer, extract the position and pose changes of parts from another set of images in different poses. Combine this information with NeRF to generate images in new poses. - **Voxel Grid Initialization**: To better initialize part segmentation and pose estimation, the author proposes a strategy based on voxel grids, estimating the initial positions of movable parts by calculating the errors of foreground masks. - **Decoupled Optimization**: Adopt an alternating optimization strategy to optimize part segmentation and pose parameters separately, thus avoiding the instability and sensitivity problems in joint optimization. ### Key Innovation Points - **Single NeRF Model**: Compared with existing methods, this method only needs to train one NeRF model instead of training each part separately, thus reducing the complexity and the number of parameters of the model. - **Efficient Learning**: Through decoupled optimization and staged training, this method can efficiently learn part segmentation and pose with a small number of target viewpoints, showing better generalization ability and stability. ### Summary The main contribution of this paper lies in proposing an unsupervised method that can learn part segmentation and pose of movable objects from multi - view images without any labels. By introducing techniques such as conditional view synthesis, voxel grid initialization and decoupled optimization, the author successfully solves the complex dependencies between part segmentation and pose estimation, achieving high - quality multi - view image synthesis and pose prediction.

Articulate your NeRF: Unsupervised articulated object modeling via conditional view synthesis

MPS-NeRF: Generalizable 3D Human Rendering from Multiview Images

Knowledge NeRF: Few-shot Novel View Synthesis for Dynamic Articulated Objects

NARF24: Estimating Articulated Object Structure for Implicit Rendering

CLA-NeRF: Category-Level Articulated Neural Radiance Field

LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

Learning Part Motion of Articulated Objects Using Spatially Continuous Neural Implicit Representations

Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction

Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs

NeRF-Feat: 6D Object Pose Estimation using Feature Rendering

Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis

Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation

SN 2 eRF: A Framework for Neural Radiance Fields given Sparse and Noisy Poses

Template-free Articulated Neural Point Clouds for Reposable View Synthesis

HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Articulated Motion-Aware NeRF for 3D Dynamic Appearance and Geometry Reconstruction by Implicit Motion States

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

CAD-NeRF: Learning NeRFs from Uncalibrated Few-view Images by CAD Model Retrieval