RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning

Kanghoon Yoon,Kibum Kim,Jaehyung Jeon,Yeonjun In,Donghyun Kim,Chanyoung Park
2024-12-17
Abstract:Scene Graph Generation (SGG) research has suffered from two fundamental challenges: the long-tailed predicate distribution and semantic ambiguity between predicates. These challenges lead to a bias towards head predicates in SGG models, favoring dominant general predicates while overlooking fine-grained predicates. In this paper, we address the challenges of SGG by framing it as multi-label classification problem with partial annotation, where relevant labels of fine-grained predicates are missing. Under the new frame, we propose Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential instances to be multi-labeled and enriches the single-label with multi-labels that are semantically similar to the original label by retrieving relevant samples from our established memory bank. Based on augmented relations (i.e., discovered multi-labels), we apply multi-prototype learning to train our SGG model. Several comprehensive experiments have demonstrated that RA-SGG outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA, particularly in terms of F@K, showing that RA-SGG effectively alleviates the issue of biased prediction caused by the long-tailed distribution and semantic ambiguity of predicates.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two fundamental problems in Scene Graph Generation (SGG): 1. **Long - tailed distribution problem**: In the benchmark SGG datasets, the distribution of predicate classes is highly unbalanced. Some predicates (such as "on") are frequently labeled, while other fine - grained predicates (such as "walking in") are rare. This causes SGG models to be biased towards common predicates when predicting and to ignore fine - grained predicates that provide detailed relationship information. 2. **Semantic ambiguity problem**: The semantic boundaries between different predicates are not clear. For example, predicates such as "on", "walking on" and "walking in" are semantically similar, making it difficult for the model to distinguish them. Since fine - grained predicates have a low frequency of occurrence in the dataset, learning these subtle visual differences becomes very challenging. To solve these problems, the authors propose a method named **Retrieval - Augmented Scene Graph Generation (RA - SGG)**. RA - SGG improves the performance of SGG models through the following steps: - **Modeling multi - label classification problems**: Redefine the SGG task as a partially - labeled multi - label classification problem, thereby discovering and enhancing potential fine - grained predicates in the training data. - **Selecting reliable multi - label instances**: Introduce the label inconsistency score to ensure that pseudo - label assignment is only performed on those relationship instances that are truly likely to have multiple labels. - **Unbiased multi - label enhancement**: Use the inverse propensity score - based sampling strategy to sample less - occurring fine - grained predicates from the retrieved relationship instances to increase data diversity. - **Multi - prototype learning**: Minimize the distance between relationship instance embeddings and their multiple predicate prototypes to ensure that the model can capture the semantics of both the original predicates and the newly - discovered fine - grained predicates simultaneously. Experimental results show that RA - SGG outperforms existing methods on the VG and GQA datasets. In particular, on the F@K metric, it significantly improves the prediction ability for fine - grained categories without sacrificing the understanding of common categories. In summary, this paper aims to overcome the challenges brought by long - tailed distribution and semantic ambiguity by improving the training method of SGG models, thereby generating more detailed and accurate scene graphs.