Abstract:Scene graph generation (SGG) has been developed to detect objects and their relationships from the visual data and has attracted increasing attention in recent years. Existing works have focused on extracting object context for SGG. However, very few works have attempted to exploit implicit contextual correlations among relationships of the objects. Furthermore, most existing SGG schemes rely on high-level features to predict the predicates while overlooking the potential inherent association of low-level features with the object relationships. We present in this article a novel scheme to capture enhanced contextual information for both objects and relationships. We design a Dual-branch Context Analysis Transformer (DCAT) architecture to extract both object context and relationship context from the visual data with dual transformer branches and then effectively fuse both high-level and low-level features by an adaptive approach to facilitate relationship prediction. Specifically, we first conduct feature representation learning to enrich relation representations by the visual, spatial, and linguistic feature extractors. Next, two transformer branches are designed to leverage the modeling of global associative interaction and mine the hidden association among objects and relationships. Then, we devise a novel feature disentangling method to decouple contextualized high-level features with guidance from the visual semantics. Finally, we develop a refined attention module to perform low-level feature recalibration for the refinement of the final predicate prediction. Experiments on Visual Genome and Action Genome datasets demonstrate the effectiveness of DCAT for both image and video SGG settings. Moreover, we also test the quality of the generated image scene graphs to verify the generalizability on downstream tasks like sentence-to-graph retrieval and image retrieval.

Deep relational self-Attention networks for scene graph generation

Attentive Relational Networks for Mapping Images to Scene Graphs

Toward Region-Aware Attention Learning for Scene Graph Generation

Relation Regularized Scene Graph Generation

Self-Supervised Relation Alignment for Scene Graph Generation

Structured Sparse R-CNN for Direct Scene Graph Generation

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Hyper-relationship Learning Network for Scene Graph Generation

Relation-Specific Feature Augmentation for unbiased scene graph generation

Boosting Scene Graph Generation with Contextual Information

Semantically Similarity-Wise Dual-Branch Network for Scene Graph Generation

Tackling the Challenges in Scene Graph Generation With Local-to-Global Interactions

Scene Graph Generation: A Comprehensive Survey

RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation

Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and Message Passing Neural Network

Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation

PANet: A Context Based Predicate Association Network for Scene Graph Generation

Semantic Relation Model and Dataset for Remote Sensing Scene Understanding

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

An End-To-End Network for Generating Social Relationship Graphs