Dynamic Scene Graph Generation Via Temporal Prior Inference
Shuang Wang,Lianli Gao,Xinyu Lyu,Yuyu Guo,Pengpeng Zeng,Jingkuan Song
DOI: https://doi.org/10.1145/3503161.3548324
2022-01-01
Abstract:Real-world videos are composed of complex actions with inherent temporal continuity (e.g., "person-touching-bottle" is usually followed by "person-holding-bottle"). In this work, we propose a novel method to mine such temporal continuity for dynamic scene graph generation (DSGG), namely Temporal Prior Inference (TPI). As opposed to current DSGG methods, which individually capture the temporal dependence of each video by refining representations, we make the first attempt to explore the temporal continuity by extracting the entire co-occurrence patterns of action categories from a variety of videos in Action Genome (AG) dataset. Then, these inherent patterns are organized as Temporal Prior Knowledge (TPK) which serves as prior knowledge for models' learning and inference. Furthermore, given the prior knowledge, human-object relationships in current frames can be effectively inferred from adjacent frames via the robust Temporal Prior Inference algorithm with tiny computation cost. Specifically, to efficiently guide the generating of temporal-consistent dynamic scene graphs, we incorporate the temporal prior inference into a DSGG framework by introducing frame enhancement, continuity loss, and fast inference. The proposed model-agnostic strategies significantly boost the performances of existing state-of-the-art models on the Action Genome dataset, achieving 69.7 and 72.6 for R@10 and R@20 on PredCLS. In addition, the inference speed can be significantly reduced by 41% with an acceptable drop on R@10 (69.7 to 66.8) by utilizing fast inference.