Abstract:This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user's inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models' vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: <a class="link-external link-https" href="https://cad-mllm.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper aims to design a unified computer - aided design (CAD) generation system that can easily generate CAD models based on user inputs (such as text descriptions, images, point clouds, or their combinations). Specifically, the authors introduce CAD - MLLM, which is the first system capable of generating parametric CAD models under multimodal conditions. Within the CAD - MLLM framework, the authors utilize the command sequences of CAD models and adopt advanced large - language models (LLMs) to align the feature spaces between these different multimodal data and the vectorized representations of CAD models. To facilitate model training, the authors design a comprehensive data construction and annotation pipeline, equipping each CAD model with corresponding multimodal data. The finally generated dataset, named Omni - CAD, is the first multimodal CAD dataset containing text descriptions, multi - view images, point clouds, and construction command sequences for each CAD model, including approximately 450,000 instances and their CAD construction sequences. The main contributions of the paper include: 1. Proposing a unified multimodal - conditional CAD generation method based on pre - trained multimodal large - language models (MLLM), supporting text, images, point clouds, and any of their combinations as conditional inputs. 2. Creating a large - scale dataset, Omni - CAD, which is the first multimodal CAD dataset containing construction modeling command sequences and corresponding text descriptions, multi - view images, and point cloud data. 3. Introducing four new evaluation metrics, namely Segment Error (SegE), Dangling Edge Length (DangEL), Self - Intersection Ratio (SIR), and Flux Enclosure Error (FluxEE), for evaluating the topological quality and closure of the generated CAD models respectively. 4. Extensive experiments show that this method performs excellently in the inference stage under various data defects, having state - of - the - art performance and high robustness compared to baseline methods.

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

CAD-LLM: Large Language Model for CAD Generation

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Img2CAD: Conditioned 3D CAD Model Generation from Single Image with Structured Visual Geometry

BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement

CAD Translator: an Effective Drive for Text to 3D Parametric Computer-Aided Design Generative Modeling

FlexCAD: Unified and Versatile Controllable CAD Generation with Fine-tuned Large Language Models

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

MultiCAD: Contrastive Representation Learning for Multi-modal 3D Computer-Aided Design Models

CaMML: Context-Aware Multimodal Learner for Large Models

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach

Multi-modal fusion network guided by prior knowledge for 3D CAD model recognition

GenCAD: Image-Conditioned Computer-Aided Design Generation with Transformer-Based Contrastive Representation and Diffusion Priors

MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization

OpenECAD: An Efficient Visual Language Model for Editable 3D-CAD Design

ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs