Abstract:Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets \textit{\textless subject, predicate, object\textgreater } in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate, it can be quite challenging to model and refine predicate representations directly across such pairs, which is however a common strategy adopted by most existing SGG methods. We observe that visual variations within the identical triplet are relatively small and certain relation cues are shared in the same type of triplet, which can potentially facilitate the relation learning in SGG. Moreover, for the long-tail problem widely studied in SGG task, it is also crucial to deal with the limited types and quantity of triplets in tail predicates. Accordingly, in this paper, we propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones. DRM utilizes contexts and semantics of predicate and triplet with Dual-granularity Constraints, generating compact and balanced representations from two perspectives to facilitate relation recognition. Furthermore, a Dual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer variation from head predicates/triplets to tail ones, aiming to enrich the pattern diversity of tail classes to alleviate the long-tail problem. Extensive experiments demonstrate the effectiveness of our method, which establishes new state-of-the-art performance on Visual Genome, Open Image, and GQA datasets. Our code is available at \url{<a class="link-external link-https" href="https://github.com/jkli1998/DRM" rel="external noopener nofollow">this https URL</a>}

GTR: A Grafting-Then-Reassembling Framework for Dynamic Scene Graph Generation

Dynamic Scene Graph Generation Via Temporal Prior Inference

Target Adaptive Context Aggregation for Video Scene Graph Generation

Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

End-to-End Video Scene Graph Generation with Temporal Propagation Transformer

SGTR+: End-to-end Scene Graph Generation with Transformer

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation

RelTR: Relation Transformer for Scene Graph Generation

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation

FTM: A Frame-level Timeline Modeling Method for Temporal Graph Representation Learning.

GITSR: Graph Interaction Transformer-based Scene Representation for Multi Vehicle Collaborative Decision-making

Leveraging Predicate and Triplet Learning for Scene Graph Generation

BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation

Structured Sparse R-CNN for Direct Scene Graph Generation

TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Scene Graph Generation With External Knowledge and Image Reconstruction