Scene Graph Generation Via Multi-Relation Classification and Cross-Modal Attention Coordinator.

Xiaoyi Zhang,Zheng Wang,Xing Xu,Jiwei Wei,Yang
DOI: https://doi.org/10.1145/3444685.3446276
2021-01-01
Abstract:Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes ( e.g. on), yet leaving poor performance with fine-grained and infrequent ones ( e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.
What problem does this paper attempt to address?