Abstract:Scene understanding is a fundamental capability needed in many domains, ranging from question-answering to robotics. Unlike recent end-to-end approaches that must explicitly learn varying compositions of the same scene, our method reasons over their constituent objects and analyzes their arrangement to infer a scene's meaning. We propose a novel approach that reasons over a scene's scene- and knowledge-graph, capturing spatial information while being able to utilize general domain knowledge in a joint graph search. Empirically, we demonstrate the feasibility of our method on the ADE20K dataset and compare it to current scene understanding approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the key challenge in **compound scene understanding**. Specifically, current scene - understanding methods often regard the scene as an indivisible whole, ignoring its complex components and the relationships between these components. For example, a kitchen may lack certain elements (such as an oven), but is still considered a kitchen because it contains other descriptive components (such as a stove, a sink, a refrigerator, etc.). Conversely, a port without water should not be recognized as a port, even if there are ships. In order to understand and classify these complex scenes more accurately, the authors propose a new method that captures the spatial information of objects in the scene and their combinatorial relationships by combining **Scene Graph (SG)** and **Knowledge Graph (KG)**. This method can dynamically establish the connection between the spatial information in the scene graph and the domain knowledge in the knowledge graph and perform reasoning during the joint graph search process. In this way, the model can better understand the constituent elements of the scene and their mutual relationships, thereby improving the ability to understand complex scenes. ### Main contributions 1. **Dual - Graph Search Approach**: Combine the scene graph and the knowledge graph for joint reasoning to understand complex scenes. 2. **Dynamic exploration mechanism**: Automatically determine when enough information has been considered and avoid unnecessary calculations. 3. **Experimental verification**: Demonstrate the feasibility of this method on the ADE20K dataset and compare it with symbolic methods and neural network methods. ### Method overview 1. **Generate scene graph and knowledge graph**: Detect objects from the input image and generate a scene graph, and initialize a knowledge graph at the same time. 2. **Merge graphs**: Combine the spatial information in the scene graph with the domain knowledge in the knowledge graph to form a merged graph. 3. **Joint graph search**: Search and predict on the merged graph through the propagation network, the importance network and the task classifier. ### Experimental results - In object - level and image - level benchmark tests, this method performs excellently, especially the accuracy on the complete test set is close to the human level. - Compared with baseline models such as GPT4 - Vision, this method shows higher accuracy in complex - scene - understanding tasks. In conclusion, this paper aims to provide a more effective method to understand and classify complex scenes by combining the scene graph and the knowledge graph, solving the problem that existing methods ignore the scene - component elements and their relationships.

Symbolic Graph Inference for Compound Scene Understanding

NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

Scene Graph Inference Via Multi-Scale Context Modeling

Attention Redirection Transformer with Semantic Oriented Learning for Unbiased Scene Graph Generation

Bridging Knowledge Graphs to Generate Scene Graphs

Symbolic image detection using scene and knowledge graphs

Symbolic Graph Reasoning Meets Convolutions

Neurosymbolic AI for Reasoning on Graph Structures: A Survey

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Graph-Structured Referring Expression Reasoning in the Wild

Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning

Configurable Graph Reasoning for Visual Relationship Detection.

Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation

Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference

Joint Modeling of Visual Objects and Relations for Scene Graph Generation.

Learning Canonical Representations for Scene Graph to Image Generation

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning