Abstract:This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper attempts to address the problem of extending large language models (LLMs) to generate 3D meshes. Specifically, the authors propose a new method called **LLaMA-Mesh**, which can represent 3D mesh data in pure text format, allowing pre-trained language models to directly process and generate 3D meshes. The main objectives of this method include: 1. **Utilizing the spatial knowledge of pre-trained language models**: Generating 3D meshes by embedding spatial knowledge from text sources (such as 3D tutorials) into LLMs. 2. **Achieving interactive 3D generation**: Allowing users to provide text prompts through a conversational interface, with the model responding with both text and 3D mesh outputs, thereby enabling interactive 3D content creation. 3. **Seamlessly unifying language and 3D modalities**: Seamlessly unifying language and 3D modalities within a single model, enabling LLMs to generate, understand, and interpret 3D meshes. ### Main Challenges - **Tokenization of 3D mesh data**: Effectively tokenizing 3D mesh data into discrete tokens so that LLMs can process it seamlessly. - **Maintaining language generation performance**: Ensuring that the original language generation capabilities of LLMs are not compromised while extending their functionality. ### Solutions - **Text representation of 3D meshes**: Representing the vertex coordinates and face definitions of 3D meshes in pure text format (such as OBJ files), thereby avoiding modifications to the tokenizer or vocabulary. - **Supervised fine-tuning (SFT) dataset**: Constructing a dataset containing text-3D pairs and interleaved text-3D dialogues for fine-tuning pre-trained LLMs. - **Quantization of vertex coordinates**: Quantizing vertex coordinates into a fixed number of intervals to reduce the length of token sequences and improve the model's ability to handle long sequences. ### Experimental Results - **High-quality and diverse 3D mesh generation**: The quality of 3D meshes generated by LLaMA-Mesh is comparable to models trained from scratch and can generate a variety of different shapes. - **Retention of language and conversational abilities**: The fine-tuned model retains strong language understanding and generation capabilities, providing coherent and contextually appropriate responses in conversations. - **Training efficiency**: Compared to specialized 3D mesh generation models, LLaMA-Mesh is more efficient in terms of training time and computational resources. ### Conclusion LLaMA-Mesh successfully extends the capabilities of LLMs to the field of 3D content generation, achieving a seamless unification of language and 3D modalities, and opening up new possibilities for 3D content creation.

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

3D-LLM: Injecting the 3D World into Large Language Models

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

3D-PreMise: Can Large Language Models Generate 3D Shapes with Sharp Features and Parametric Control?

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

LDM: Large Tensorial SDF Model for Textured Mesh Generation

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

Grounded 3D-LLM with Referent Tokens

Towards 3D Molecule-Text Interpretation in Language Models

More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Don't Mesh with Me: Generating Constructive Solid Geometry Instead of Meshes by Fine-Tuning a Code-Generation LLM

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

3D-GPT: Procedural 3D Modeling with Large Language Models

Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes