Abstract:Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.

Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation

Visual Translation Embedding Network for Visual Relation Detection

Visual relationship detection with a deep convolutional relationship network

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

RelTR: Relation Transformer for Scene Graph Generation

Vrr-Vg: Refocusing Visually-Relevant Relationships

Obj-GloVe: Scene-Based Contextual Object Embedding

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction

Visual Topic Semantic Enhanced Machine Translation for Multi-Modal Data Efficiency

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Visual Relations Augmented Cross-modal Retrieval

Rethinking Visual Relationships for High-level Image Understanding.

Complex Relation Embedding for Scene Graph Generation

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Localize, Assemble, And Predicate: Contextual Object Proposal Embedding For Visual Relation Detection

Visual Relationship Detection With Image Position and Feature Information Embedding and Fusion

Relation Transformer Network

Towards Flexible Visual Relationship Segmentation

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners