Symbolic Graph Inference for Compound Scene Understanding

FNU Aryan,Simon Stepputtis,Sarthak Bhagat,Joseph Campbell,Kwonjoon Lee,Hossein Nourkhiz Mahjoub,Katia Sycara
2024-10-30
Abstract:Scene understanding is a fundamental capability needed in many domains, ranging from question-answering to robotics. Unlike recent end-to-end approaches that must explicitly learn varying compositions of the same scene, our method reasons over their constituent objects and analyzes their arrangement to infer a scene's meaning. We propose a novel approach that reasons over a scene's scene- and knowledge-graph, capturing spatial information while being able to utilize general domain knowledge in a joint graph search. Empirically, we demonstrate the feasibility of our method on the ADE20K dataset and compare it to current scene understanding approaches.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenge in **compound scene understanding**. Specifically, current scene - understanding methods often regard the scene as an indivisible whole, ignoring its complex components and the relationships between these components. For example, a kitchen may lack certain elements (such as an oven), but is still considered a kitchen because it contains other descriptive components (such as a stove, a sink, a refrigerator, etc.). Conversely, a port without water should not be recognized as a port, even if there are ships. In order to understand and classify these complex scenes more accurately, the authors propose a new method that captures the spatial information of objects in the scene and their combinatorial relationships by combining **Scene Graph (SG)** and **Knowledge Graph (KG)**. This method can dynamically establish the connection between the spatial information in the scene graph and the domain knowledge in the knowledge graph and perform reasoning during the joint graph search process. In this way, the model can better understand the constituent elements of the scene and their mutual relationships, thereby improving the ability to understand complex scenes. ### Main contributions 1. **Dual - Graph Search Approach**: Combine the scene graph and the knowledge graph for joint reasoning to understand complex scenes. 2. **Dynamic exploration mechanism**: Automatically determine when enough information has been considered and avoid unnecessary calculations. 3. **Experimental verification**: Demonstrate the feasibility of this method on the ADE20K dataset and compare it with symbolic methods and neural network methods. ### Method overview 1. **Generate scene graph and knowledge graph**: Detect objects from the input image and generate a scene graph, and initialize a knowledge graph at the same time. 2. **Merge graphs**: Combine the spatial information in the scene graph with the domain knowledge in the knowledge graph to form a merged graph. 3. **Joint graph search**: Search and predict on the merged graph through the propagation network, the importance network and the task classifier. ### Experimental results - In object - level and image - level benchmark tests, this method performs excellently, especially the accuracy on the complete test set is close to the human level. - Compared with baseline models such as GPT4 - Vision, this method shows higher accuracy in complex - scene - understanding tasks. In conclusion, this paper aims to provide a more effective method to understand and classify complex scenes by combining the scene graph and the knowledge graph, solving the problem that existing methods ignore the scene - component elements and their relationships.