CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Jingwei Xu,Chenyu Wang,Zibo Zhao,Wen Liu,Yi Ma,Shenghua Gao
2024-11-08
Abstract:This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user's inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models' vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: <a class="link-external link-https" href="https://cad-mllm.github.io/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to design a unified computer - aided design (CAD) generation system that can easily generate CAD models based on user inputs (such as text descriptions, images, point clouds, or their combinations). Specifically, the authors introduce CAD - MLLM, which is the first system capable of generating parametric CAD models under multimodal conditions. Within the CAD - MLLM framework, the authors utilize the command sequences of CAD models and adopt advanced large - language models (LLMs) to align the feature spaces between these different multimodal data and the vectorized representations of CAD models. To facilitate model training, the authors design a comprehensive data construction and annotation pipeline, equipping each CAD model with corresponding multimodal data. The finally generated dataset, named Omni - CAD, is the first multimodal CAD dataset containing text descriptions, multi - view images, point clouds, and construction command sequences for each CAD model, including approximately 450,000 instances and their CAD construction sequences. The main contributions of the paper include: 1. Proposing a unified multimodal - conditional CAD generation method based on pre - trained multimodal large - language models (MLLM), supporting text, images, point clouds, and any of their combinations as conditional inputs. 2. Creating a large - scale dataset, Omni - CAD, which is the first multimodal CAD dataset containing construction modeling command sequences and corresponding text descriptions, multi - view images, and point cloud data. 3. Introducing four new evaluation metrics, namely Segment Error (SegE), Dangling Edge Length (DangEL), Self - Intersection Ratio (SIR), and Flux Enclosure Error (FluxEE), for evaluating the topological quality and closure of the generated CAD models respectively. 4. Extensive experiments show that this method performs excellently in the inference stage under various data defects, having state - of - the - art performance and high robustness compared to baseline methods.