Abstract:Scene Graph Generation (SGG) is a typical computer vision task that detects objects and corresponding predicates in an image. Existing SGG methods focus on modeling visual contexts to generate scene graphs and are conducted on well-annotated datasets with high-quality images. However, the quality is unguaranteed for images in social media posts, so that some images may be incomplete or occluded by some obstacles, hence might not provide sufficient visual context for SGG. Therefore, previous methods might result in missing or false visual relationship detection due to lacking visual contexts. To effectively generate the scene graphs in social media, we study multimodal scene graph generation (MSG) in this paper. MSG aims to develop visual scene graphs from images in social media posts with the support of text sentences. However, leveraging textual contents by simple multimodal alignment such as object-level alignment neglects the inherent pair-wise mapping between multimodal object pairs. To address the limitations, we propose a method named Deep pair-wise Relation Alignment for Knowledge-Enhanced (DRAKE) multimodal scene graph generation. The model supplements the missing visual contexts with well-aligned textual knowledge. It first represents the textual information into object-aware knowledge representation with the help of vision data. Furthermore, our proposed DRAKE facilitates the interaction of the info between multimodal pair-wise representations. A multimodal context enhancement layer can be devised to help the model generate the scene graph. To evaluate the model performance of SGG on social media images, we propose a social media SGG dataset called MSG. We comprehensively analyze the effectiveness of our proposed method on the MSG dataset. The experimental results on the MSG dataset indicate that our model outperforms the previous methods. To fairly compare our method with other SGG models, we also conduct experiments on the Visual Genome dataset for more analysis The MSG dataset is released on https://github.com/FuZe4ever/MSG.

Knowledge-Enhanced Scene Graph Generation with Multimodal Relation Alignment (Student Abstract)

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

DRAKE: Deep Pair-Wise Relation Alignment for Knowledge-Enhanced Multimodal Scene Graph Generation in Social Media Posts

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Self-Supervised Relation Alignment for Scene Graph Generation

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation.

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical Center

Scene Graph Generation With External Knowledge and Image Reconstruction

Knowledge-Embedded Routing Network for Scene Graph Generation

Hypercomplex context guided interaction modeling for scene graph generation

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

Scene Graph Generation with Geometric Context

Knowledge-aware Dialogue Generation with Hybrid Attention (Student Abstract)

Scene Graph Generation With Hierarchical Context

HKA: A Hierarchical Knowledge Alignment Framework for Multimodal Knowledge Graph Completion