Abstract:Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes. Extensive experiments demonstrate that our 3D-VLAP achieves comparable results with current advanced fully supervised methods, meanwhile significantly alleviating the pressure of data annotation.

3D Scene Graph Generation from Point Clouds

Semantic Graph Based Place Recognition for 3D Point Clouds.

Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Scene Segmentation and Understanding for Context-Free Point Clouds

Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions

Instance-incremental Scene Graph Generation from Real-world Point Clouds via Normalizing Flows

SGFormer: Semantic Graph Transformer for Point Cloud-based 3D Scene Graph Generation

Recognition of Indoor Scenes Using 3-D Scene Graphs

Superpoint-guided Semi-supervised Semantic Segmentation of 3D Point Clouds

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

3D Scene Graph Prediction on Point Clouds Using Knowledge Graphs

Exploring Deep 3D Spatial Encodings for Large-Scale 3D Scene Understanding

ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling

Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

SRNet: A 3D Scene Recognition Network Using Static Graph and Dense Semantic Fusion.

On Support Relations Inference and Scene Hierarchy Graph Construction from Point Cloud in Clustered Environments

Exploring Hierarchical Spatial Layout Cues for 3D Point Cloud based Scene Graph Prediction

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck

SAM-guided Graph Cut for 3D Instance Segmentation