MMFENet:Multi-Modal Feature Enhancement Network with Transformer for Human-Object Interaction Detection

Peng Wang,Wenhao Yang,Zhicheng Zhao,Fei Su
DOI: https://doi.org/10.1109/ijcnn60899.2024.10650930
2024-01-01
Abstract:Transformer model has been successfully applied to human-object interaction (HOI) detection in a one-stage mode. However, this mode has not fully leveraged rich and valuable clues such as scenes and linguistics etc, thereby weakening the Transformer’s representation power. In this paper, Transformer is introduced to a novel two-stage HOI detection framework, named MMFENet, where an interaction subnet and a feature enhancement subnet are constructed to jointly detect HOIs. Specifically, the interaction subnet is firstly constructed to effectively fuse visual (or semantic) features and spatial information, thus enhancing the relationships of all human-object pairs. In addition, the feature enhancement subnet is built to improve the contextual comprehension of interaction features by incorporating a diverse range of scene information encoded in both visual and linguistic modalities. Extensive experimental results show that MMFENet achieves very competitive results on the two public HOI detection benchmarks (HICO-DET and V-COCO).
What problem does this paper attempt to address?