Abstract:Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind's broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: <a class="link-external link-https" href="https://zeroooooooow1440.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of 3D object generation under multimodal conditions. Existing 3D generation models are usually for specific tasks and can only handle data of one modality (such as text - to - 3D, image - to - 3D), and need to be retrained when the modality is changed. This single - modality approach has limitations in practical applications because data in the real world is often multimodal, and each modality provides unique value. In addition, some modalities (such as audio) cannot be directly converted into 3D content, and converting them into text or images and then generating 3D content will lead to information loss. To overcome these limitations, the authors propose a unified framework named XBind for generating 3D objects from any modality (including text, image, and audio). XBind integrates a multimodal alignment encoder and a pre - trained diffusion model through cross - modality pre - alignment technology to achieve 3D generation under multimodal conditions. Specifically, XBind introduces a new loss function - modality similarity (MS) loss, as well as hybrid diffusion supervision and a three - stage optimization process to improve the quality of generated 3D objects. ### Main contributions 1. **Modality similarity (MS) loss**: - A new loss function is proposed, which improves the quality of 3D generation results by aligning the embeddings generated by the multimodal alignment encoder with the image embeddings rendered from 3D objects. 2. **Three - stage optimization framework**: - A three - stage optimization framework from coarse to fine is introduced, combined with hybrid diffusion supervision, which significantly improves the visual quality and consistency of any - modality - to - 3D generation. 3. **XBind framework**: - A pioneering unified framework XBind for 3D generation under any modality condition is constructed. The experimental results verify the superior performance of XBind. ### Method overview 1. **Modality similarity (MS) loss**: - By aligning the embeddings generated by the multimodal alignment encoder with the image embeddings rendered from 3D objects, the correct guidance of various modalities in the 3D object generation process is enhanced. 2. **Hybrid diffusion supervision**: - Pixel - level planar supervision and space - level stereo supervision are combined to form hybrid diffusion supervision. Pixel - level planar supervision includes consistency distillation sampling (CDS) loss and enhanced 2D SDS loss, and space - level stereo supervision includes 3D SDS loss and reference view loss. 3. **Three - stage optimization**: - Stage 1: Coarse optimization, using low - resolution NeRF to generate rough 3D geometries and textures. - Stage 2: Geometric refinement, converting low - resolution NeRF to high - resolution DMT ET and optimizing the geometric details of 3D objects. - Stage 3: Texture optimization, fixing the geometric shape parameters and optimizing the texture details of 3D objects. Through these innovations, XBind can directly generate high - quality 3D objects from any modality provided by the user, reducing time and resource consumption and alleviating information loss in modality conversion.

Any-to-3D Generation via Hybrid Diffusion Supervision

ImageBind3D: Image As Binding Step for Controllable 3D Generation

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

TV-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

Control3D: Towards Controllable Text-to-3D Generation

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

Generating Images with 3D Annotations Using Diffusion Models

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior