Abstract:This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.

What problem does this paper attempt to address?

This paper attempts to address the problem of constructing 3D scene graphs in arbitrary indoor and outdoor environments. Specifically, the paper faces the following three main challenges: 1. **Semantic Differences Between Indoor and Outdoor Environments**: While the hierarchical structure of concepts such as objects, rooms, floors, and buildings is relatively clear for indoor environments, the concepts needed to describe various outdoor scenes are less intuitive. Therefore, manually defining these label sets for each application is impractical. 2. **Lack of Training Datasets**: Although there are mature training datasets available for indoor scene graph generation, datasets for creating semantically rich 3D scene graphs in outdoor scenes are almost non-existent. For example, existing work utilizes annotations from OpenStreetMap (OSM), but OSM only provides a few annotation categories such as roads, highways, and buildings, excluding smaller objects. This makes it difficult for existing Graph Neural Network (GNN)-based methods to group nodes (such as objects) into higher-level concepts (such as rooms) to construct hierarchical representations. 3. **Reliability of Learning-Based Methods**: GNN-based methods may produce erroneous predictions when the amount of training data is relatively small or when testing outside the training domain. Therefore, a method is needed to leverage common-sense knowledge to constrain GNN predictions, improving generalization and accuracy across different types of scenes. To address these challenges, the paper proposes a neuro-symbolic approach to achieve 3D scene graph construction in arbitrary environments through the following two main extensions: 1. **Constructing Spatial Ontology**: The paper develops methods to construct a spatial ontology describing concepts and relationships relevant to indoor and outdoor robotic operations. Specifically, large language models (LLMs) are used to automatically construct this ontology, significantly reducing the required manual effort. 2. **Utilizing Logic Tensor Networks**: The paper employs Logic Tensor Networks (LTNs) to incorporate logical rules or axioms (e.g., "a beach contains sand") into 3D scene graph construction. These rules provide additional supervision signals during training, reducing the need for annotated data, improving prediction quality, and even allowing for the prediction of concepts not seen during training. The paper tests the method on multiple datasets, including indoor, rural, and coastal environments, and demonstrates that the method significantly improves the quality of 3D scene graph generation under sparse annotation data.

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs*

OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor Environments

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Leveraging Large Language Models for Robot 3D Scene Understanding

Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions

Reasoning about the Unseen for Efficient Outdoor Object Navigation

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization

Indoor Scene Understanding with Geometric and Semantic Contexts

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Open-Vocabulary Octree-Graph for 3D Scene Understanding

SceneGPT: A Language Model for 3D Scene Understanding

A Bottom-up Framework for Construction of Structured Semantic 3D Scene Graph.

Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments

Using Language to Generate State Abstractions for Long-Range Planning in Outdoor Environments

Extracting Zero-shot Common Sense from Large Language Models for Robot 3D Scene Understanding

Knowledge Graph Construction with Structure and Parameter Learning for Indoor Scene Design

Scene understanding using natural language description based on 3D semantic graph map

3D Scene Graph Generation from Point Clouds