GenZI: Zero-Shot 3D Human-Scene Interaction Generation

Lei Li,Angela Dai
DOI: https://doi.org/10.48550/arXiv.2311.17737
2023-11-29
Abstract:Can we synthesize 3D humans interacting with scenes without learning from any 3D human-scene interaction data? We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions. Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions. Given a natural language description and a coarse point location of the desired interaction in a 3D scene, we first leverage VLMs to imagine plausible 2D human interactions inpainted into multiple rendered views of the scene. We then formulate a robust iterative optimization to synthesize the pose and shape of a 3D human model in the scene, guided by consistency with the 2D interaction hypotheses. In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data, and allows for flexible control of the 3D interaction synthesis with easy-to-use text prompts. Extensive experiments show that our zero-shot approach has high flexibility and generality, making it applicable to diverse scene types, including both indoor and outdoor environments.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to generate realistic 3D human - scene interactions without using any 3D human - scene interaction data?** Specifically, the author proposes a new method named GenZI, which can synthesize human poses and shapes in 3D environments given natural language descriptions and rough interaction positions, without relying on traditional supervised learning or 3D datasets. ### Problem Background Existing 3D human - scene interaction (HSI) synthesis methods usually rely on a large amount of carefully captured interaction data in real - 3D environments. However, the acquisition of these data is very difficult and expensive, requiring precise tracking and reconstruction techniques and ensuring sufficient diversity and representativeness. Therefore, the scale of existing 3D HSI datasets is limited, and the application scenarios are also restricted. ### Innovations of GenZI To solve the above problems, GenZI proposes a brand - new zero - shot method to achieve 3D HSI synthesis through the following steps: 1. **Utilize powerful 2D vision - language models (VLMs)**: GenZI uses existing large - scale vision - language models to generate possible 2D human - interaction images. These models have learned rich human - scene semantic relationships on 2D images. 2. **Multi - view rendering and dynamic masking**: Given a 3D scene, a text prompt, and a rough interaction position, GenZI will render the scene from multiple views and use a dynamic masking scheme to automatically estimate the mask area, thus reasonably inserting the human body into the scene. 3. **Robust 3D pose optimization**: By lifting 2D interaction hypotheses to 3D space, GenZI optimizes a parameterized 3D human model (such as SMPL - X) to make its pose and shape consistent with the 2D interaction hypotheses. 4. **Iterative refinement**: Through multiple iterations of 2D image inpainting and 3D optimization, the consistency and realism of the generated 3D human - scene interaction are further improved. ### Main Contributions - **Achieve zero - shot 3D HSI generation for the first time**: Without the supervision of any 3D interaction data, it can flexibly synthesize various scenes and actions. - **Dynamic masking scheme**: Allows automatic generation of reasonable 2D human - scene combinations without manually specifying masks. - **Robust 3D pose optimization**: Through view - consistency constraints, ensure the authenticity and consistency of the generated 3D human - scene interaction. In conclusion, GenZI provides a novel and efficient method that can generate realistic 3D human - scene interactions without relying on large - scale 3D datasets and has broad application prospects.