LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Zhengyi Wang,Jonathan Lorraine,Yikai Wang,Hang Su,Jun Zhu,Sanja Fidler,Xiaohui Zeng
2024-11-15
Abstract:This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper attempts to address the problem of extending large language models (LLMs) to generate 3D meshes. Specifically, the authors propose a new method called **LLaMA-Mesh**, which can represent 3D mesh data in pure text format, allowing pre-trained language models to directly process and generate 3D meshes. The main objectives of this method include: 1. **Utilizing the spatial knowledge of pre-trained language models**: Generating 3D meshes by embedding spatial knowledge from text sources (such as 3D tutorials) into LLMs. 2. **Achieving interactive 3D generation**: Allowing users to provide text prompts through a conversational interface, with the model responding with both text and 3D mesh outputs, thereby enabling interactive 3D content creation. 3. **Seamlessly unifying language and 3D modalities**: Seamlessly unifying language and 3D modalities within a single model, enabling LLMs to generate, understand, and interpret 3D meshes. ### Main Challenges - **Tokenization of 3D mesh data**: Effectively tokenizing 3D mesh data into discrete tokens so that LLMs can process it seamlessly. - **Maintaining language generation performance**: Ensuring that the original language generation capabilities of LLMs are not compromised while extending their functionality. ### Solutions - **Text representation of 3D meshes**: Representing the vertex coordinates and face definitions of 3D meshes in pure text format (such as OBJ files), thereby avoiding modifications to the tokenizer or vocabulary. - **Supervised fine-tuning (SFT) dataset**: Constructing a dataset containing text-3D pairs and interleaved text-3D dialogues for fine-tuning pre-trained LLMs. - **Quantization of vertex coordinates**: Quantizing vertex coordinates into a fixed number of intervals to reduce the length of token sequences and improve the model's ability to handle long sequences. ### Experimental Results - **High-quality and diverse 3D mesh generation**: The quality of 3D meshes generated by LLaMA-Mesh is comparable to models trained from scratch and can generate a variety of different shapes. - **Retention of language and conversational abilities**: The fine-tuned model retains strong language understanding and generation capabilities, providing coherent and contextually appropriate responses in conversations. - **Training efficiency**: Compared to specialized 3D mesh generation models, LLaMA-Mesh is more efficient in terms of training time and computational resources. ### Conclusion LLaMA-Mesh successfully extends the capabilities of LLMs to the field of 3D content generation, achieving a seamless unification of language and 3D modalities, and opening up new possibilities for 3D content creation.