Abstract:Significant progress has been made in training large generative models for natural language and images. Yet, the advancement of 3D generative models is hindered by their substantial resource demands for training, along with inefficient, non-compact, and less expressive representations. This paper introduces Make-A-Shape, a new 3D generative model designed for efficient training on a vast scale, capable of utilizing 10 millions publicly-available shapes. Technical-wise, we first innovate a wavelet-tree representation to compactly encode shapes by formulating the subband coefficient filtering scheme to efficiently exploit coefficient relations. We then make the representation generatable by a diffusion model by devising the subband coefficients packing scheme to layout the representation in a low-resolution grid. Further, we derive the subband adaptive training strategy to train our model to effectively learn to generate coarse and detail wavelet coefficients. Last, we extend our framework to be controlled by additional input conditions to enable it to generate shapes from assorted modalities, e.g., single/multi-view images, point clouds, and low-resolution voxels. In our extensive set of experiments, we demonstrate various applications, such as unconditional generation, shape completion, and conditional generation on a wide range of modalities. Our approach not only surpasses the state of the art in delivering high-quality results but also efficiently generates shapes within a few seconds, often achieving this in just 2 seconds for most conditions. Our source code is available at <a class="link-external link-https" href="https://github.com/AutodeskAILab/Make-a-Shape" rel="external noopener nofollow">this https URL</a>.

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

CLIPtortionist: Zero-shot Text-driven Deformation for Manufactured 3D Shapes

Zero-Shot Text-to-Image Generation

Text-Free Controllable 3-D Point Cloud Generation

ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation

Zero3D: Semantic-Driven 3D Shape Generation for Zero-Shot Learning.

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Make-A-Shape: a Ten-Million-scale 3D Shape Model

AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

Text-to-3D Shape Generation

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model

Text‐to‐3D Shape Generation

Neural Shape Compiler: A Unified Framework for Transforming between Text, Point Cloud, and Program

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation