Any-to-3D Generation via Hybrid Diffusion Supervision

Yijun Fan,Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
2024-11-22
Abstract:Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind's broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: <a class="link-external link-https" href="https://zeroooooooow1440.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of 3D object generation under multimodal conditions. Existing 3D generation models are usually for specific tasks and can only handle data of one modality (such as text - to - 3D, image - to - 3D), and need to be retrained when the modality is changed. This single - modality approach has limitations in practical applications because data in the real world is often multimodal, and each modality provides unique value. In addition, some modalities (such as audio) cannot be directly converted into 3D content, and converting them into text or images and then generating 3D content will lead to information loss. To overcome these limitations, the authors propose a unified framework named XBind for generating 3D objects from any modality (including text, image, and audio). XBind integrates a multimodal alignment encoder and a pre - trained diffusion model through cross - modality pre - alignment technology to achieve 3D generation under multimodal conditions. Specifically, XBind introduces a new loss function - modality similarity (MS) loss, as well as hybrid diffusion supervision and a three - stage optimization process to improve the quality of generated 3D objects. ### Main contributions 1. **Modality similarity (MS) loss**: - A new loss function is proposed, which improves the quality of 3D generation results by aligning the embeddings generated by the multimodal alignment encoder with the image embeddings rendered from 3D objects. 2. **Three - stage optimization framework**: - A three - stage optimization framework from coarse to fine is introduced, combined with hybrid diffusion supervision, which significantly improves the visual quality and consistency of any - modality - to - 3D generation. 3. **XBind framework**: - A pioneering unified framework XBind for 3D generation under any modality condition is constructed. The experimental results verify the superior performance of XBind. ### Method overview 1. **Modality similarity (MS) loss**: - By aligning the embeddings generated by the multimodal alignment encoder with the image embeddings rendered from 3D objects, the correct guidance of various modalities in the 3D object generation process is enhanced. 2. **Hybrid diffusion supervision**: - Pixel - level planar supervision and space - level stereo supervision are combined to form hybrid diffusion supervision. Pixel - level planar supervision includes consistency distillation sampling (CDS) loss and enhanced 2D SDS loss, and space - level stereo supervision includes 3D SDS loss and reference view loss. 3. **Three - stage optimization**: - Stage 1: Coarse optimization, using low - resolution NeRF to generate rough 3D geometries and textures. - Stage 2: Geometric refinement, converting low - resolution NeRF to high - resolution DMT ET and optimizing the geometric details of 3D objects. - Stage 3: Texture optimization, fixing the geometric shape parameters and optimizing the texture details of 3D objects. Through these innovations, XBind can directly generate high - quality 3D objects from any modality provided by the user, reducing time and resource consumption and alleviating information loss in modality conversion.