Abstract:Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models. We propose a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement (FKR) module to learn object and predicate categories separately at the feature level and align them with corresponding visual features so that the model is no longer limited to triples matching. Besides, since we observe the long-tailed problem also affects the generalization ability, we design a novel balanced learning strategy, including a Charater Guided Sampling (CGS) and an Informative Re-weighting (IR) module, to provide tailor-made learning methods for each predicate according to their characters. Extensive experiments show that our model achieves state-of-the-art performance. In more detail, TISGG boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls sub-task on the Visual Genome dataset.

Beyond Entities: A Large-Scale Multi-Modal Knowledge Graph with Triplet Fact Grounding

Triple-as-Node Knowledge Graph and Its Embeddings

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models

TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio

MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction.

Generalizable Entity Grounding via Assistance of Large Language Model

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Graphusion: A RAG Framework for Knowledge Graph Construction with a Global Perspective

Multi-source Knowledge Enhanced Graph Attention Networks for Multimodal Fact Verification

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Enhancing missing facts inference in knowledge graph using triplet subgraph attention embeddings

FactKG: Fact Verification via Reasoning on Knowledge Graphs

AspectMMKG: A Multi-modal Knowledge Graph with Aspect-aware Entities

Multi-perspective knowledge graph completion with global and interaction features

Learning graph attention-aware knowledge graph embedding

Graphusion: Leveraging Large Language Models for Scientific Knowledge Graph Fusion and Construction in NLP Education

KERMIT: Knowledge Graph Completion of Enhanced Relation Modeling with Inverse Transformation

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation