LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Chenming Zhu,Tai Wang,Wenwei Zhang,Jiangmiao Pang,Xihui Liu
2024-09-27
Abstract:Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the insufficient ability of existing large - scale multimodal models (LMMs) in 3D scene understanding. Although significant progress has been made in 2D vision tasks, LMMs still face the following challenges when dealing with 3D scenes: 1. **Lack of large - scale 3D vision - language datasets**: Compared with abundant 2D data, 3D datasets are relatively scarce, which limits the learning ability of the model. 2. **Absence of powerful 3D encoders**: Pretrained models like CLIP ViT in 2D do not exist in the 3D field yet, resulting in difficulty in extracting high - quality 3D features. To solve these problems, the author proposes a new framework named LLaVA - 3D. The core innovations of this framework include: - **Introduction of 3D Patch representation**: By combining 2D image patch features with 3D position information, a new 3D representation method is constructed, thus endowing the model with 3D spatial perception ability. - **Efficient 3D pooling strategy**: To reduce computational overhead and preserve key information, the author explores two parameter - free pooling strategies - voxelization pooling and farthest point sampling (FPS) pooling. - **Unified architecture design**: By jointly optimizing 2D and 3D vision - language instructions, LLaVA - 3D can not only perform well in 3D tasks but also maintain the original 2D image understanding and reasoning ability. Specifically, LLaVA - 3D achieves its goals through the following steps: 1. **Constructing 3D Patches**: - Extract 2D image patch features \( X_v' \in \mathbb{R}^{V \times c \times w \times h} \) from multi - view images. - Use the internal and external parameters of the camera to obtain the positions \( P \in \mathbb{R}^{V \times 3 \times w \times h} \) of these image patches in the 3D world. - Encode the 3D positions into position embeddings \( P' \in \mathbb{R}^{V \times w \times h \times d} \), and add them to the 2D image patch features through a two - layer MLP to form 3D Patches \( X_{3D}' = X_v' + \text{MLP}(P') \). 2. **3D Pooling**: - Use voxelization pooling or farthest point sampling (FPS) pooling to reduce the number of 3D Patches, ensuring that the model can efficiently process a large number of input images. 3. **Position encoding and decoding**: - At the input end, use 3D Coordinate Tokens to process language instructions containing 3D coordinates. - At the output end, use special location tokens to guide the model to generate accurate 3D bounding boxes. 4. **Training process**: - **Stage 1: 3D Patch - language alignment**: Train with region - level and scene - level caption data that describe the spatial relationships between 3D objects, freeze the visual encoder and LLM parameters, and only train the projection layer and 3D position embedding layer. - **Stage 2: Task - instruction tuning**: Use a mixed 2D and 3D dataset (LLaVA - 3D - Instruct - 1M) for instruction tuning to optimize the model's performance in complex 3D V&L tasks while maintaining 2D image reasoning and instruction - following abilities. The experimental results show that LLaVA - 3D achieves state - of - the - art performance in multiple 3D benchmark tests, and its performance in 2D tasks is also comparable to that of the original LLaVA, proving its effectiveness as a general - purpose model.