Abstract:3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6% Acc@0.25 on ScanRefer and 48.0% Acc on Nr3D, without relying on 3D geometry or object priors. Codes are available at <a class="link-external link-https" href="https://github.com/OpenRobotLab/VLM-Grounder" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **Zero - Shot 3D Visual Grounding**. Specifically, researchers hope to accurately locate target objects in 3D scenes through natural language queries and 2D image sequences without relying on 3D point cloud data or prior knowledge of objects. ### Background and Challenges Traditional methods mainly rely on supervised learning and use paired 3D point clouds and language data to train end - to - end models. However, existing visual localization datasets are scarce and limited to predefined vocabularies, which makes it difficult to develop general - purpose models suitable for the open world. Recently, some zero - shot methods based on large - language models (LLMs) have been proposed to address the problem of insufficient data. Although these methods are effective, they only use object - centric information, limiting their ability to handle complex queries, such as "find the room with the most sunlight". ### Innovations of VLM - Grounder To solve the above problems, this paper proposes **VLM - Grounder**, a new framework based on visual - language models (VLMs) for zero - shot 3D visual localization. The main contributions of VLM - Grounder include: 1. **Dynamic Stitching Strategy**: To overcome the limitations of VLMs when processing a large number of images (such as the maximum number of images, context length, etc.), researchers designed a dynamic stitching strategy that stitches multiple images into a single image and selects the optimal layout according to benchmark tests, thereby improving the performance of VLMs. 2. **Localization and Feedback Mechanism**: VLMs analyze user queries and interpret their reasoning processes, and automatically provide feedback to ensure more accurate results. When VLMs give invalid responses, the system will retry until a valid target is found or the retry limit is reached. 3. **Multi - view Integrated Projection**: To estimate the 3D bounding box from a single image, VLM - Grounder uses multi - view images for joint estimation. Different views of the same target object are found through image matching, and these views are combined to jointly estimate the 3D bounding box. In addition, morphological operations are also used to better handle the problem of inaccurate depth. ### Experimental Results The experimental results show that VLM - Grounder significantly outperforms previous zero - shot methods on the ScanRefer and Nr3D datasets. Specifically, on the ScanRefer dataset, VLM - Grounder achieves an Acc@0.25 of 51.6%, and on the Nr3D dataset, it achieves an overall accuracy of 48.0%, both exceeding previous methods. ### Summary VLM - Grounder overcomes the limitations of existing methods that rely on 3D point cloud data and prior knowledge of objects by introducing a VLM - based zero - shot 3D visual localization framework, demonstrating superior performance under complex queries.

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Learning Visual Grounding from Generative Vision and Language Model

Grounded 3D-LLM with Referent Tokens

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Learning to Ground VLMs without Forgetting

Zero-shot detection of buildings in mobile LiDAR using Language Vision Model

Advancing 3D Object Grounding Beyond a Single 3D Scene

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Zero-Shot Video Grounding With Pseudo Query Lookup and Verification