Abstract:3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two main challenges in 3D Visual Grounding: 1. **Entanglement of target object features and spatial layout features**: In previous methods, the attribute features (such as shape and color) of the target object and the spatial layout features are intertwined. This makes it difficult for the model to distinguish different types of attention, thus affecting the accuracy of positioning. For example, when the model focuses on the target object and its surrounding environment simultaneously, it may disperse its attention to objects of the same category but in different positions. 2. **Text information fails to effectively guide the visual cross - attention module**: Existing methods fail to fully utilize text information to guide the attention of query points when processing visual features. This makes the query points only be able to roughly learn the features of neighboring points, causing the model to focus on redundant or irrelevant spatial layout information and further dispersing the attention. To solve these problems, the author proposes a new framework - PD - APE (Parallel Decoding Framework with Adaptive Position Encoding). This framework improves the 3D visual positioning task in the following ways: - **Two - branch decoder**: A parallel decoder is designed, which contains two branches. One branch is used to decode the features of the target object, and the other branch is used to perceive the layout of the surrounding environment. This design enables the model to focus on the target object and the surrounding environment separately, avoiding the dispersion of attention. - **Adaptive position encoding**: Different adaptive position encoding methods are designed for each branch. For the target object branch, the position encoding is based on the relative position between the seed point and the predicted 3D box; for the surrounding environment branch, the position encoding also combines the confidence between visual and text features to better capture the spatial layout information in the description. Through these improvements, PD - APE can outperform the existing state - of - the - art methods on two widely - used 3D visual positioning datasets, ScanRefer and Nr3D, and achieve new best performance.

PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

DPANet: Position‐aware feature encoding and decoding for accurate large‐scale point cloud semantic segmentation

A Unified Framework for 3D Point Cloud Visual Grounding

Aligning and Prompting Everything All at Once for Universal Visual Perception

Revisiting 3D Visual Grounding with Context-aware Feature Aggregation

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors

3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization