Articulate your NeRF: Unsupervised articulated object modeling via conditional view synthesis

Jianning Deng,Kartic Subr,Hakan Bilen
2024-06-24
Abstract:We propose a novel unsupervised method to learn the pose and part-segmentation of articulated objects with rigid parts. Given two observations of an object in different articulation states, our method learns the geometry and appearance of object parts by using an implicit model from the first observation, distils the part segmentation and articulation from the second observation while rendering the latter observation. Additionally, to tackle the complexities in the joint optimization of part segmentation and articulation, we propose a voxel grid-based initialization strategy and a decoupled optimization procedure. Compared to the prior unsupervised work, our model obtains significantly better performance, and generalizes to objects with multiple parts while it can be efficiently from few views for the latter observation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to learn part segmentation and pose of movable objects with rigid parts in an unsupervised manner from observations in two different poses. Specifically, given two sets of observation data (each set contains images from multiple viewpoints) of an object in different poses, the author proposes a novel method to learn the geometries, appearances of each part of the object and their relative motions. ### Main Problems 1. **Part Segmentation and Pose Estimation**: How to automatically identify each rigid part of an object from unlabeled data and estimate the motion parameters (such as rotation axes and translations) of these parts. 2. **Multi - View Synthesis**: How to generate images in new viewpoints by adjusting the poses of parts based on the known object model in one pose. 3. **Optimization Complexity**: How to handle the complex dependencies between part segmentation and pose estimation to ensure that the model can converge stably and obtain high - quality results. ### Solutions To solve the above problems, the author proposes the following methods: - **Implicit Model and NeRF**: First, use NeRF (Neural Radiance Field) to learn the geometry and appearance of the object from a set of images in a fixed pose. Then freeze these parameters so that the geometry and appearance remain unchanged in subsequent steps. - **Conditional View Synthesis**: By introducing a bottleneck layer, extract the position and pose changes of parts from another set of images in different poses. Combine this information with NeRF to generate images in new poses. - **Voxel Grid Initialization**: To better initialize part segmentation and pose estimation, the author proposes a strategy based on voxel grids, estimating the initial positions of movable parts by calculating the errors of foreground masks. - **Decoupled Optimization**: Adopt an alternating optimization strategy to optimize part segmentation and pose parameters separately, thus avoiding the instability and sensitivity problems in joint optimization. ### Key Innovation Points - **Single NeRF Model**: Compared with existing methods, this method only needs to train one NeRF model instead of training each part separately, thus reducing the complexity and the number of parameters of the model. - **Efficient Learning**: Through decoupled optimization and staged training, this method can efficiently learn part segmentation and pose with a small number of target viewpoints, showing better generalization ability and stability. ### Summary The main contribution of this paper lies in proposing an unsupervised method that can learn part segmentation and pose of movable objects from multi - view images without any labels. By introducing techniques such as conditional view synthesis, voxel grid initialization and decoupled optimization, the author successfully solves the complex dependencies between part segmentation and pose estimation, achieving high - quality multi - view image synthesis and pose prediction.