Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Kejun Xue,Yongbin Gao,Zhijun Fang,Xiaoyan Jiang,Wenjun Yu,Mingxuan Chen,Chenmou Wu
DOI: https://doi.org/10.1007/s10489-024-05774-7
IF: 5.3
2024-10-01
Applied Intelligence
Abstract:Human-object interaction (HOI) detection is an important computer vision task for recognizing the interaction between humans and surrounding objects in an image or video. The HOI datasets have a serious long-tailed data distribution problem because it is challenging to have a dataset that contains all potential interactions. Many HOI detectors have addressed this issue by utilizing visual-language models. However, due to the calculation mechanism of the Transformer, the visual-language model is not good at extracting the local features of input samples. Therefore, we propose a novel local feature enhanced Transformer to motivate encoders to extract multi-modal features that contain more information. Moreover, it is worth noting that the application of prompt learning in HOI detection is still in preliminary stages. Consequently, we propose a multi-modal adaptive prompt module, which uses an adaptive learning strategy to facilitate the interaction of language and visual prompts. In the HICO-DET and SWIG-HOI datasets, the proposed model achieves full interaction with 24.21% mAP and 14.29% mAP, respectively. Our code is available at https://github.com/small-code-cat/AMP-HOI.
computer science, artificial intelligence
What problem does this paper attempt to address?