Abstract:3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.

What problem does this paper attempt to address?

The paper primarily addresses the challenges in the 3D visual grounding task, aiming to improve localization accuracy by effectively aligning language descriptions with object attributes and spatial relationships in 3D scenes. Specifically, the paper proposes a new network architecture named DASANet (Dual Attribute-Spatial Relation Alignment Network). The main contributions of this approach include: 1. **Proposing DASANet**: To achieve fine-grained visual-language alignment, the paper designs a dual-branch network—DASANet, which models object attributes and spatial relationships separately and performs interpretable fine-grained alignment in both modalities. 2. **GTAS Training Strategy**: To better separate attribute features and spatial features, the paper introduces a training strategy based on Ground-Truth Attribute Scores (GTAS). This helps improve feature decoupling and fine-grained feature alignment while enhancing the model's interpretability. 3. **Performance**: DASANet achieves the highest localization accuracy (65.1%) on the Nr3D dataset, outperforming the best competitor by 1.3%, and also performs well on the Sr3D dataset, demonstrating its effectiveness in the 3D visual grounding task. In summary, the paper addresses the issue of how to effectively utilize object attributes and spatial relationship information in language descriptions to accurately identify target objects in 3D scenes, and validates the effectiveness and superiority of the proposed method through experiments.

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding