Abstract:Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the Panoptic Scene Graph Generation (PSG) task, almost all existing methods overlook the importance of shape - aware features, and these features are crucial for capturing the contour and boundary information of objects. Therefore, the author proposes a new model - agnostic curriculum learning strategy - Curricular shApe - aware FEature (CAFE) to make up for this deficiency. ### Specific background of the problem 1. **Limitations of traditional Scene Graph Generation (SGG)**: - SGG relies on the bounding - box - based paradigm, which may lead to inaccurate object localization and limited background annotation. - The emerging PSG solves these problems by using a more fine - grained panoptic segmentation representation (i.e., scene masks) and defines the relationships between backgrounds. 2. **Deficiencies of existing PSG methods**: - Most existing PSG methods inherit the strategies of SGG and still mainly rely on spatial features extracted from the minimum bounding box (bbox). - This method ignores shape - aware features (such as mask features and boundary features), resulting in possible semantic confusion in fine - grained visual relationship prediction. ### Solutions proposed in the paper To overcome the above problems, the author proposes the following solutions: 1. **Introducing shape - aware features**: - Shape - aware features include two types: mask features and boundary features. - Mask features utilize the details in the fine - grained mask representation, including the shape and contour of the object; boundary features are extracted from the intersection of the subject and object masks and are helpful for capturing the interaction between subject - object pairs. 2. **Curriculum Learning strategy**: - Inspired by the human cognitive process, the author proposes a phased learning strategy, dividing predicates into three difficulty groups and correspondingly dividing the training process into three phases. - Each phase uses a specialized relationship classifier to handle predicates in a specific group and gradually increases the complexity of features, from simple bbox features to complex boundary features. - The knowledge distillation technique is adopted between different phases to retain the knowledge obtained in the early phases. 3. **Model - agnosticism**: - CAFE is a model - agnostic strategy that can be seamlessly integrated into any existing PSG model, thereby improving its performance. ### Experimental verification Through extensive experiments on challenging PSG datasets, the author proves the effectiveness and robustness of CAFE. Specifically: - In the robust PSG task, CAFE achieves new state - of - the - art performance among different metrics. - In the zero - shot PSG task, CAFE can infer unseen visual relationship triplets by utilizing the robust visual relationship features learned during training. ### Summary The main contributions of this paper include: 1. In - depth exploration of the key problems existing in the PSG task: over - relying on bounding - box - based spatial features while ignoring shape - aware features. 2. Proposing a new model - agnostic curriculum learning strategy (CAFE), enabling the model to learn shape - aware features in a simple - to - complex manner. 3. Demonstrating the robustness and effectiveness of CAFE through extensive experimental results, significantly outperforming the existing state - of - the - art methods.

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation

LEARNING SHAPE PRIORS BY PAIRWISE COMPARISON FOR ROBUST SEMANTIC SEGMENTATION

4D Panoptic Scene Graph Generation

Adaptive Feature Learning for Unbiased Scene Graph Generation

1st Place Solution for PSG competition with ECCV'22 SenseHuman Workshop

Addressing Predicate Overlap in Scene Graph Generation with Semantic Granularity Controller

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Learning Category- and Instance-Aware Pixel Embedding for Fast Panoptic Segmentation

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

Pair then Relation: Pair-Net for Panoptic Scene Graph Generation

OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models

LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation

CAFE: Learning to Condense Dataset by Aligning Features

Compositional Feature Augmentation for Unbiased Scene Graph Generation

Edge Weight Prediction For Category-Agnostic Pose Estimation

Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Perceptual Visual Feature Learning With Applications in Sports Educational Image Understanding

Progressive Feature Learning for Facade Parsing with Occlusions