SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li,Shijie Li,Lingdong Kong,Xulei Yang,Junwei Liang
2024-12-06
Abstract:3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, which is essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose to represent 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7% on ScanRefer and 7.1% on Nr3D, showcasing its effectiveness.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to achieve accurate positioning of target objects in 3D scenes without additional 3D data training. Specifically, the author proposes a new method named SeeGround, aiming to solve the zero - shot open - vocabulary 3D visual grounding problem (3DVG) by combining 2D vision - language models (VLMs) and 3D spatial descriptions. Traditional methods rely on annotated 3D datasets and predefined object categories, which limit their scalability and adaptability. SeeGround, on the other hand, utilizes VLMs trained on large - scale 2D data and represents 3D scenes as a mixture of query - aligned rendered images and spatially rich text descriptions, thus bridging the gap between 3D data and the 2D - VLM input format. ### Main Problems and Solutions 1. **Limitations of Traditional 3DVG Methods**: - **Dependence on Annotated Data**: Existing methods usually require a large amount of annotated 3D datasets, which are not only costly but also difficult to extend to diverse real - world environments. - **Lack of Flexibility**: These methods can only handle predefined object categories and are unable to deal with open - vocabulary situations. 2. **Innovations of SeeGround**: - **Zero - Shot Learning**: By using VLMs trained on large - scale 2D data, SeeGround can perform 3D object localization without 3D - specific training data. - **Cross - Modal Alignment**: Representing 3D scenes as a combination of 2D rendered images and 3D spatial descriptions enables 2D - VLMs to understand 3D structures and relationships. - **Dynamic Viewpoint Selection**: The Perspective Adaptation Module is introduced to dynamically select the best viewpoint according to the query, capturing the key details and spatial relationships of the target object. - **Fusion Alignment Module**: By explicitly correlating key objects in the image with 3D text descriptions, the localization ambiguity in multi - object scenes is reduced, and the efficiency and accuracy are improved. ### Formula Representation The core formulas of SeeGround are as follows: - **3D Scene Representation**: \[ (I, T)=F(S, Q, OLT) \] where \( S \) is the 3D scene, \( Q \) is the query, \( OLT \) is the Object Lookup Table, \( I \) is the 2D rendered image, and \( T \) is the text - based spatial description. - **Depth - Aware Visual Cue**: \[ I_m = I\odot(1 - 1_{P_{\text{visible}}(o)})+M_o\odot 1_{P_{\text{visible}}(o)} \] where \( 1_{P_{\text{visible}}(o)} \) is the visibility indicator of object \( o \), \( \odot \) represents element - wise multiplication, and \( M_o \) is the visual cue. - **Object Prediction**: \[ \hat{o}=VLM(Q\mid I_m, T) \] ### Summary SeeGround achieves zero - shot 3D object localization without additional 3D training data by combining 2D - VLM and 3D spatial descriptions. This method performs well in the ScanRefer and Nr3D benchmark tests, especially having strong robustness and accuracy in complex scenes.