VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Runsen Xu,Zhiwei Huang,Tai Wang,Yilun Chen,Jiangmiao Pang,Dahua Lin
2024-10-18
Abstract:3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6% Acc@0.25 on ScanRefer and 48.0% Acc on Nr3D, without relying on 3D geometry or object priors. Codes are available at <a class="link-external link-https" href="https://github.com/OpenRobotLab/VLM-Grounder" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **Zero - Shot 3D Visual Grounding**. Specifically, researchers hope to accurately locate target objects in 3D scenes through natural language queries and 2D image sequences without relying on 3D point cloud data or prior knowledge of objects. ### Background and Challenges Traditional methods mainly rely on supervised learning and use paired 3D point clouds and language data to train end - to - end models. However, existing visual localization datasets are scarce and limited to predefined vocabularies, which makes it difficult to develop general - purpose models suitable for the open world. Recently, some zero - shot methods based on large - language models (LLMs) have been proposed to address the problem of insufficient data. Although these methods are effective, they only use object - centric information, limiting their ability to handle complex queries, such as "find the room with the most sunlight". ### Innovations of VLM - Grounder To solve the above problems, this paper proposes **VLM - Grounder**, a new framework based on visual - language models (VLMs) for zero - shot 3D visual localization. The main contributions of VLM - Grounder include: 1. **Dynamic Stitching Strategy**: To overcome the limitations of VLMs when processing a large number of images (such as the maximum number of images, context length, etc.), researchers designed a dynamic stitching strategy that stitches multiple images into a single image and selects the optimal layout according to benchmark tests, thereby improving the performance of VLMs. 2. **Localization and Feedback Mechanism**: VLMs analyze user queries and interpret their reasoning processes, and automatically provide feedback to ensure more accurate results. When VLMs give invalid responses, the system will retry until a valid target is found or the retry limit is reached. 3. **Multi - view Integrated Projection**: To estimate the 3D bounding box from a single image, VLM - Grounder uses multi - view images for joint estimation. Different views of the same target object are found through image matching, and these views are combined to jointly estimate the 3D bounding box. In addition, morphological operations are also used to better handle the problem of inaccurate depth. ### Experimental Results The experimental results show that VLM - Grounder significantly outperforms previous zero - shot methods on the ScanRefer and Nr3D datasets. Specifically, on the ScanRefer dataset, VLM - Grounder achieves an Acc@0.25 of 51.6%, and on the Nr3D dataset, it achieves an overall accuracy of 48.0%, both exceeding previous methods. ### Summary VLM - Grounder overcomes the limitations of existing methods that rely on 3D point cloud data and prior knowledge of objects by introducing a VLM - based zero - shot 3D visual localization framework, demonstrating superior performance under complex queries.