Abstract:3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, which is essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose to represent 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7% on ScanRefer and 7.1% on Nr3D, showcasing its effectiveness.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve accurate positioning of target objects in 3D scenes without additional 3D data training. Specifically, the author proposes a new method named SeeGround, aiming to solve the zero - shot open - vocabulary 3D visual grounding problem (3DVG) by combining 2D vision - language models (VLMs) and 3D spatial descriptions. Traditional methods rely on annotated 3D datasets and predefined object categories, which limit their scalability and adaptability. SeeGround, on the other hand, utilizes VLMs trained on large - scale 2D data and represents 3D scenes as a mixture of query - aligned rendered images and spatially rich text descriptions, thus bridging the gap between 3D data and the 2D - VLM input format. ### Main Problems and Solutions 1. **Limitations of Traditional 3DVG Methods**: - **Dependence on Annotated Data**: Existing methods usually require a large amount of annotated 3D datasets, which are not only costly but also difficult to extend to diverse real - world environments. - **Lack of Flexibility**: These methods can only handle predefined object categories and are unable to deal with open - vocabulary situations. 2. **Innovations of SeeGround**: - **Zero - Shot Learning**: By using VLMs trained on large - scale 2D data, SeeGround can perform 3D object localization without 3D - specific training data. - **Cross - Modal Alignment**: Representing 3D scenes as a combination of 2D rendered images and 3D spatial descriptions enables 2D - VLMs to understand 3D structures and relationships. - **Dynamic Viewpoint Selection**: The Perspective Adaptation Module is introduced to dynamically select the best viewpoint according to the query, capturing the key details and spatial relationships of the target object. - **Fusion Alignment Module**: By explicitly correlating key objects in the image with 3D text descriptions, the localization ambiguity in multi - object scenes is reduced, and the efficiency and accuracy are improved. ### Formula Representation The core formulas of SeeGround are as follows: - **3D Scene Representation**: \[ (I, T)=F(S, Q, OLT) \] where \( S \) is the 3D scene, \( Q \) is the query, \( OLT \) is the Object Lookup Table, \( I \) is the 2D rendered image, and \( T \) is the text - based spatial description. - **Depth - Aware Visual Cue**: \[ I_m = I\odot(1 - 1_{P_{\text{visible}}(o)})+M_o\odot 1_{P_{\text{visible}}(o)} \] where \( 1_{P_{\text{visible}}(o)} \) is the visibility indicator of object \( o \), \( \odot \) represents element - wise multiplication, and \( M_o \) is the visual cue. - **Object Prediction**: \[ \hat{o}=VLM(Q\mid I_m, T) \] ### Summary SeeGround achieves zero - shot 3D object localization without additional 3D training data by combining 2D - VLM and 3D spatial descriptions. This method performs well in the ScanRefer and Nr3D benchmark tests, especially having strong robustness and accuracy in complex scenes.

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Mono3DVG: 3D Visual Grounding in Monocular Images

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Data-Efficient 3D Visual Grounding via Order-Aware Referring

Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Exploiting Contextual Objects and Relations for 3D Visual Grounding.

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Advancing 3D Object Grounding Beyond a Single 3D Scene

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

OV-VG: A benchmark for open-vocabulary visual grounding

Zero-Shot Video Grounding With Pseudo Query Lookup and Verification

3D Visual Grounding-Audio: 3D scene object detection based on audio

Grounded 3D-LLM with Referent Tokens

Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart Cities

Task-oriented Sequential Grounding in 3D Scenes