Abstract:Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient ability of existing large - scale multimodal models (LMMs) in 3D scene understanding. Although significant progress has been made in 2D vision tasks, LMMs still face the following challenges when dealing with 3D scenes: 1. **Lack of large - scale 3D vision - language datasets**: Compared with abundant 2D data, 3D datasets are relatively scarce, which limits the learning ability of the model. 2. **Absence of powerful 3D encoders**: Pretrained models like CLIP ViT in 2D do not exist in the 3D field yet, resulting in difficulty in extracting high - quality 3D features. To solve these problems, the author proposes a new framework named LLaVA - 3D. The core innovations of this framework include: - **Introduction of 3D Patch representation**: By combining 2D image patch features with 3D position information, a new 3D representation method is constructed, thus endowing the model with 3D spatial perception ability. - **Efficient 3D pooling strategy**: To reduce computational overhead and preserve key information, the author explores two parameter - free pooling strategies - voxelization pooling and farthest point sampling (FPS) pooling. - **Unified architecture design**: By jointly optimizing 2D and 3D vision - language instructions, LLaVA - 3D can not only perform well in 3D tasks but also maintain the original 2D image understanding and reasoning ability. Specifically, LLaVA - 3D achieves its goals through the following steps: 1. **Constructing 3D Patches**: - Extract 2D image patch features \( X_v' \in \mathbb{R}^{V \times c \times w \times h} \) from multi - view images. - Use the internal and external parameters of the camera to obtain the positions \( P \in \mathbb{R}^{V \times 3 \times w \times h} \) of these image patches in the 3D world. - Encode the 3D positions into position embeddings \( P' \in \mathbb{R}^{V \times w \times h \times d} \), and add them to the 2D image patch features through a two - layer MLP to form 3D Patches \( X_{3D}' = X_v' + \text{MLP}(P') \). 2. **3D Pooling**: - Use voxelization pooling or farthest point sampling (FPS) pooling to reduce the number of 3D Patches, ensuring that the model can efficiently process a large number of input images. 3. **Position encoding and decoding**: - At the input end, use 3D Coordinate Tokens to process language instructions containing 3D coordinates. - At the output end, use special location tokens to guide the model to generate accurate 3D bounding boxes. 4. **Training process**: - **Stage 1: 3D Patch - language alignment**: Train with region - level and scene - level caption data that describe the spatial relationships between 3D objects, freeze the visual encoder and LLM parameters, and only train the projection layer and 3D position embedding layer. - **Stage 2: Task - instruction tuning**: Use a mixed 2D and 3D dataset (LLaVA - 3D - Instruct - 1M) for instruction tuning to optimize the model's performance in complex 3D V&L tasks while maintaining 2D image reasoning and instruction - following abilities. The experimental results show that LLaVA - 3D achieves state - of - the - art performance in multiple 3D benchmark tests, and its performance in 2D tasks is also comparable to that of the original LLaVA, proving its effectiveness as a general - purpose model.

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

3D-LLM: Injecting the 3D World into Large Language Models

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

Language-Image Models with 3D Understanding

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models