Abstract:As a widely explored multi-modal task, 3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce a more realistic setting, named Group-wise 3D Object Grounding, to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. Instead of localizing target objects in each scene individually, we argue that ignoring the rich visual information contained in other related 3D scenes within the same group may lead to sub-optimal results. To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. Specifically, based on context-aware spatial-semantic alignment, a language-guided consensus aggregation module is developed to aggregate the visual features of target objects in each 3D scene to form a visual consensus representation, which is then distributed and injected into a consensus-modulated feature refinement module for refining visual features, thus benefiting the subsequent multi-modal reasoning. To validate the effectiveness of the proposed method, we reorganize and enhance the ReferIt3D dataset and propose evaluation metrics to benchmark prior work and GNL3D. Extensive experiments demonstrate that GNL3D achieves state-of-the-art results on the group-wise setting and the traditional 3D object grounding task.

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Multi-View Domain Adaptive Object Detection on Camera Networks.

Grounded 3D-LLM with Referent Tokens

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

Language Adaptive Weight Generation for Multi-task Visual Grounding

Cross-Modal Match for Language Conditioned 3D Object Grounding

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Mono3DVG: 3D Visual Grounding in Monocular Images

Advancing 3D Object Grounding Beyond a Single 3D Scene

Monocular 3D Object Detection via Feature Domain Adaptation

CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection

VLDadaptor: Domain Adaptive Object Detection with Vision-Language Model Distillation

GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds

3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds

Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

Towards CLIP-driven Language-free 3D Visual Grounding Via 2D-3D Relational Enhancement and Consistency

Bi3D: Bi-domain Active Learning for Cross-domain 3D Object Detection