SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation

Anh Duc Bui,Soyeon Caren Han,Josiah Poon
DOI: https://doi.org/10.48550/arXiv.2211.04773
2022-11-09
Abstract:Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-mutually exclusive, which can be described from multiple perspectives like geometrical relationships or semantic relationships, making it even more challenging to predict the most suitable relationship label. In this work, we proposed the SG-Shuffle pipeline for scene graph generation with 3 components: 1) Parallel Transformer Encoder, which learns to predict object relationships in a more exclusive manner by grouping relationship labels into groups of similar purpose; 2) Shuffle Transformer, which learns to select the final relationship labels from the category-specific feature generated in the previous step; and 3) Weighted CE loss, used to alleviate the training bias caused by the imbalanced dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in the Scene Graph Generation (SGG) task: 1. **Long - tail bias problem**: In the existing annotated data, object and predicate labels have a severe long - tail distribution, that is, a few common labels account for most of the data, while a large number of uncommon labels have scarce data. This data imbalance causes current methods to tend to predict common, less - informative relationship labels and ignore important information in the tail categories. The information of these tail categories is crucial for downstream tasks because they provide unique perspectives and detailed descriptions. 2. **Multi - aspect relationship prediction**: Relationships are sometimes non - mutually exclusive and can be described from multiple perspectives, such as geometric relationships or semantic relationships. This makes predicting the most appropriate label more complicated. The method proposed in the paper aims to improve the classification ability of labels with similar semantic relationships by grouping and processing relationship labels with similar purposes, reducing the classification interference between different semantic spaces. 3. **Model bias**: Traditional SGG methods are usually based on object detectors to build predicate prediction modules, which leads to high performance on head categories and low performance on tail categories in the model. The paper proposes a new architecture. By introducing parallel Transformer encoders, Shuffle Transformer, and Weighted Cross - Entropy Loss (Weighted CE Loss), it alleviates the bias problem caused by data imbalance during the training process. ### Main contributions of the paper: 1. **Classifying related labels**: By classifying related labels and learning category - specific predicate features, the classification interference between different semantic spaces is reduced, and the classification ability of labels with similar semantic relationships is improved. 2. **Shuffle Transformer layer**: A Shuffle Transformer layer is proposed to fuse fine - grained features of different foci, obtain general predicate features for predicate classification. 3. **Weighted cross - entropy loss**: A simple loss weight strategy is applied during the training process to further deal with the long - tail bias problem within the same category. ### Method overview: 1. **Parallel Transformer encoders**: Four independent Transformer sub - models are used to learn category - specific representations of four types of relationships: geometric, possessive, semantic, and miscellaneous respectively. 2. **Shuffle Transformer**: The output features of the previous step are combined through the Shuffle Transformer layer, allowing information to flow between different sub - models and further spreading context information. 3. **Weighted cross - entropy loss**: Weighted cross - entropy loss is applied in the final stage of training to balance the learning process of each predicate label and reduce the long - tail bias problem. ### Experimental results: - **Quantitative evaluation**: On the VG150 dataset, compared with existing SGG methods, SG - Shuffle shows significant performance improvements in all three settings of PredCls, SGCls, and SGDet. - **Hyperparameter tuning**: By adjusting the number of Shuffle layers, it is found that 5 - layer Shuffle Transformer performs best in most settings. - **Ablation study**: By removing the Shuffle layer or the weighted CE loss, the effectiveness of each component is verified, proving the importance of the combination of the two. In conclusion, through innovative architecture design and loss function optimization, this paper effectively solves the long - tail bias problem in the SGG task and improves the performance of the model on tail categories.