Spatially-Aware Human-Object Interaction Detection with Cross-Modal Enhancement

Gaowen Liu,Huan Liu,Caixia Yan,Yuyang Guo,Rui Li,Sizhe Dang
DOI: https://doi.org/10.1007/978-981-99-8073-4_7
2024-01-01
Abstract:We propose a novel two-stage HOI detection model that incorporates cross-modal spatial information awareness. Human-object relative spatial relationships are highly relevant for specific HOI species, but current approaches fail to model such crucial cues explicitly. We observed that relative spatial relationships possess properties that can be described in natural language easily and intuitively. Building on this observation and inspired by recent advancements in prompt-tuning, we design a Prompt-Enhanced Spatial Modeling (PESM) module that generates linguistic descriptions of spatial relations between humans and objects. PESM is capable of merging the explicit spatial information obtained by the aforementioned text descriptions with the implicit spatial information of the visual modality. Moreover, we devise a two-stage model architecture that effectively incorporates auxiliary cues to exploit the enhanced cross-modal spatial information. Extensive experiments conducted on the HICO-DET benchmark demonstrate that the proposed model outperforms state-of-the-art methods, indicating its effectiveness and superiority. The source code is available at https://github.com/liugaowen043/tsce .
What problem does this paper attempt to address?