Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Bowen Jiang,Zhijun Zhuang,Shreyas S. Shivakumar,Camillo J. Taylor
2024-07-16
Abstract:This work introduces an enhanced approach to generating scene graphs by incorporating both a relationship hierarchy and commonsense knowledge. Specifically, we begin by proposing a hierarchical relation head that exploits an informative hierarchical structure. It jointly predicts the relation super-category between object pairs in an image, along with detailed relations under each super-category. Following this, we implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system, removing nonsensical predicates even with a small language-only model. Extensive experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms. The results show significant improvements with an extensive set of reasonable predictions beyond dataset annotations. Codes are available at <a class="link-external link-https" href="https://github.com/bowen-upenn/scene_graph_commonsense" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Enhancing the accuracy and rationality of scene graph generation**: Although existing scene graph generation methods have made certain progress in identifying objects in images and their relationships, there are still some unreasonable relationship predictions. These predictions may have high confidence, but are less likely to occur in the real world. For example, a relationship such as "the rabbit jumps onto the plate" may be predicted with high probability by the model, but it is actually not reasonable. Therefore, the paper proposes a new method to improve the accuracy and rationality of scene graph generation by introducing the Hierarchical Relation Head and the Commonsense Validation Pipeline. 2. **Utilizing hierarchical relationship structures and common - sense knowledge**: In order to improve the quality of scene graph generation, the paper proposes methods of using the natural hierarchical structure of relationships and common - sense knowledge. Specifically, the paper first proposes a Hierarchical Relation Head, which can utilize the hierarchical structure among relationship categories to jointly predict the super - category relationships between object pairs and the detailed relationships under each super - category. Secondly, the paper implements a powerful Commonsense Validation Pipeline, which uses basic models (such as large - language models or vision - language models) to evaluate the output of the scene graph generation system and remove predicates that do not conform to common sense, and can work effectively even when using small - language models. Through the above methods, the paper aims to improve the performance of existing scene graph generation algorithms. In particular, experiments on the Visual Genome and OpenImage V6 datasets show that the proposed modules can be seamlessly integrated into existing scene graph generation algorithms, significantly improving the rationality and accuracy of prediction results.