Human-Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration

Haozhong Wang,Hua Yu,Qiang Zhang
DOI: https://doi.org/10.1016/j.neunet.2023.11.002
Abstract:Recent two-stage detector-based methods show superiority in Human-Object Interaction (HOI) detection along with the successful application of transformer. However, these methods are limited to extracting the global contextual features through instance-level attention without considering the perspective of human-object interaction pairs, and the fusion enhancement of interaction pair features lacks further exploration. The human-object interaction pairs guiding global context extraction relative to instance guiding global context extraction more fully utilize the semantics between human-object pairs, which helps HOI recognition. To this end, we propose a two-stage Global Context and Pairwise-level Fusion Features Integration Network (GFIN) for HOI detection. Specifically, the first stage employs an object detector for instance feature extraction. The second stage aims to capture the semantic-rich visual information through the proposed three modules, Global Contextual Feature Extraction Encoder (GCE), Pairwise Interaction Query Decoder (PID), and Human-Object Pairwise-level Attention Fusion Module (HOF). The GCE module intends to extract the global context memory by the proposed crossover-residual mechanism and then integrate it with the local instance memory from the DETR object detector. HOF utilizes the proposed pairwise-level attention mechanism to fuse and enhance the first stage's multi-layer feature. PID outputs multi-label interaction recognition results with the input of the query sequence from HOF and the memory from GCE. Finally, comprehensive experiments conducted on HICO-DET and V-COCO datasets demonstrate that the proposed GFIN significantly outperforms the state-of-the-art methods. Code is available at https://github.com/ddwhzh/GFIN.
What problem does this paper attempt to address?