GTR: A Grafting-Then-Reassembling Framework for Dynamic Scene Graph Generation

Jiafeng Liang,Yuxin Wang,Zekun Wang,Ming Liu,Ruiji Fu,Zhongyuan Wang,Bing Qin
DOI: https://doi.org/10.24963/ijcai.2023/131
2023-01-01
Abstract:Dynamic scene graph generation aims to identify visual relationships (subject-predicate-object) in frames based on spatio-temporal contextual information in the video. Previous work implicitly models the spatio-temporal interaction simultaneously, which leads to entanglement of spatio-temporal contextual information. To this end, we propose a Grafting-Then-Reassembling framework ( GTR ), which explicitly extracts intra-frame spatial information and inter-frame temporal information in two separate stages to decouple spatio-temporal contextual information. Specifically, we first graft a static scene graph generation model to generate static visual relationships within frames. Then we propose the temporal dependency model to extract the temporal dependencies across frames, and explicitly reassemble static visual relationships into dynamic scene graphs. Experimental results show that GTR achieves the state-of-the-art performance on Action Genome dataset. Further analyses reveal that the reassembling stage is crucial to the success of our framework.
What problem does this paper attempt to address?