SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects

Jiayi Liu,Denys Iliash,Angel X. Chang,Manolis Savva,Ali Mahdavi-Amiri
2024-10-30
Abstract:We address the challenge of creating 3D assets for household articulated objects from a single image. Prior work on articulated object creation either requires multi-view multi-state input, or only allows coarse control over the generation process. These limitations hinder the scalability and practicality for articulated object modeling. In this work, we propose a method to generate articulated objects from a single image. Observing the object in resting state from an arbitrary view, our method generates an articulated object that is visually consistent with the input image. To capture the ambiguity in part shape and motion posed by a single view of the object, we design a diffusion model that learns the plausible variations of objects in terms of geometry and kinematics. To tackle the complexity of generating structured data with attributes in multiple domains, we design a pipeline that produces articulated objects from high-level structure to geometric details in a coarse-to-fine manner, where we use a part connectivity graph and part abstraction as proxies. Our experiments show that our method outperforms the state-of-the-art in articulated object creation by a large margin in terms of the generated object realism, resemblance to the input image, and reconstruction quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: generating 3D object models with joints from a single image. Specifically, the authors aim to overcome the limitations of existing methods, which include the need for multi - view or multi - state inputs, or only being able to roughly control the generation process. These problems limit the scalability and practicality of articulated object modeling. ### Problem Background 1. **Importance of Articulated Objects** - Articulated objects are very common in the daily environment, such as furniture, household appliances, etc. - Creating realistic 3D articulated objects is crucial for constructing virtual environments, especially in fields such as robotics and embodied AI. 2. **Limitations of Existing Methods** - Methods with multi - view or multi - state inputs require precise alignment of data from different views or states, which is not always feasible in practical applications. - Unconditional generation or methods guided only by high - level constraints cannot provide the flexibility required by users and it is difficult to precisely specify the generated objects. ### Method Proposed in the Paper To solve the above problems, the paper proposes a new method for generating articulated objects from a single image. Specifically: - **Input**: An RGB image showing the object in a static state. - **Output**: A 3D articulated object model, including: - A part connection graph, specifying the connection relationships and motion hierarchies between parts. - Joint parameters, describing the connection types, axes, and motion ranges between each part. - The geometry of each joint part, down to operable parts (such as handles, knobs). ### Key Points of the Solution 1. **Diffusion Model** - Use a diffusion model to learn reasonable changes in objects in terms of geometry and motion to deal with the ambiguity brought by a single view. 2. **Hierarchical Generation Pipeline** - Design a pipeline for gradually generating articulated objects from coarse to fine, ensuring the controllability and interpretability of the generation process. - Step 1: Use a large - scale visual - language model to infer the part connection graph. - Step 2: Generate abstract attributes describing joint parts, guided by the connection graph and the image. - Step 3: Retrieve meshes from the part shape library and assemble them into the final object. 3. **Attention Mechanism** - Use a Transformer - based diffusion model to capture the spatial layout of parts through image cross - attention, capture the interaction between part motion and shape through self - attention, and structure parts through graph attention. ### Main Contributions 1. **First Exploration**: As far as the authors know, this is the first attempt to generate articulated objects from a single image. 2. **Modular Generation**: Design a modular generation pipeline from coarse to fine, which improves user editability and interpretability. 3. **Diffusion Model**: Propose a diffusion model where the generated objects are visually consistent with the input image while allowing for changes to deal with the ambiguity in the image. 4. **Performance Verification**: Through systematic evaluation on multiple datasets, better reconstruction quality and generalization ability are demonstrated. ### Conclusion The paper proposes an innovative method that can generate high - quality and reasonable articulated object models from a single image, significantly improving the scalability and practicality of articulated object modeling.