Abstract:Scene graph generation (SGG) has been developed to detect objects and their relationships from the visual data and has attracted increasing attention in recent years. Existing works have focused on extracting object context for SGG. However, very few works have attempted to exploit implicit contextual correlations among relationships of the objects. Furthermore, most existing SGG schemes rely on high-level features to predict the predicates while overlooking the potential inherent association of low-level features with the object relationships. We present in this article a novel scheme to capture enhanced contextual information for both objects and relationships. We design a Dual-branch Context Analysis Transformer (DCAT) architecture to extract both object context and relationship context from the visual data with dual transformer branches and then effectively fuse both high-level and low-level features by an adaptive approach to facilitate relationship prediction. Specifically, we first conduct feature representation learning to enrich relation representations by the visual, spatial, and linguistic feature extractors. Next, two transformer branches are designed to leverage the modeling of global associative interaction and mine the hidden association among objects and relationships. Then, we devise a novel feature disentangling method to decouple contextualized high-level features with guidance from the visual semantics. Finally, we develop a refined attention module to perform low-level feature recalibration for the refinement of the final predicate prediction. Experiments on Visual Genome and Action Genome datasets demonstrate the effectiveness of DCAT for both image and video SGG settings. Moreover, we also test the quality of the generated image scene graphs to verify the generalizability on downstream tasks like sentence-to-graph retrieval and image retrieval.

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Learning Visual Commonsense for Robust Scene Graph Generation

Scene Graph Generation for Better Image Captioning?

Generating Triples with Adversarial Networks for Scene Graph Construction

PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Part-Aware Interactive Learning for Scene Graph Generation

BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation

Adaptive Self-training Framework for Fine-grained Scene Graph Generation

Learning to Generate Scene Graph from Natural Language Supervision

Scene Graph Generation With Hierarchical Context

Boosting Scene Graph Generation with Contextual Information

Learning Canonical Representations for Scene Graph to Image Generation

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Embodied Semantic Scene Graph Generation.

Hypercomplex context guided interaction modeling for scene graph generation