Human Object Interaction Detection via Multi-level Conditioned Network

Xu Sun,Xinwen Hu,Tongwei Ren,Gangshan Wu
DOI: https://doi.org/10.1145/3372278.3390671
2020-01-01
Abstract:As one of the essential problems in scene understanding, human object interaction detection (HOID) aims to recognize fine-grained object-specific human actions, which demands the capabilities of both visual perception and reasoning. Existing methods based on convolutional neural network (CNN) utilize diverse visual features for HOID, which are insufficient for complex human object interaction understanding. To enhance the reasoning capablity of CNN, we propose a novel multi-level conditioned network that fuses extra spatial-semantic knowledge with visual features. Specifically, we construct a multi-branch CNN as backbone for multi-level visual representation. We then encode extra knowledge including human body structure and object context as condition to dynamically influence the feature extraction of CNN by affine transformation and attention mechanism. Finally, we fuse the modulated multimodal features to distinguish the interactions. The proposed method is evaluated on two most frequently-used benchmarks, HICO-DET and V-COCO. The experiment results show that our method is superior to the state-of-the-arts.
What problem does this paper attempt to address?