Detecting Zero-Shot Human-Object Interaction with Visual-Text Modeling

Haozhong Wang,Hua Yu,Qiang Zhang
DOI: https://doi.org/10.1109/ICVR57957.2023.10169554
2023-01-01
Abstract:Most existing Human-Object Interaction (HOI) detection methods focus on supervised learning, but labeling all interactions is costly because of the enormous possible combinations of objects and verbs. Zero-shot HOI detection emerges as a promising approach to address this problem but encounters challenges when facing unseen interactions. To this end, we propose a novel two-stage Visual-Text modeling HOI detection (VT-HOI) method which can effectively recognize both seen and unseen interactions. In the first stage, the features of the humans and the objects are extracted by DETR and concatenated as the query sequences. In the second stage, local and global memory features from the Visual Encoder are fused into the corresponding query sequences by our proposed Semantic Representation Decoder with the cross-attention mechanism. Then we perform cosine similarity computation between visual features and text features, which are extracted or label-generated by Visual Representation Head (VRH) and Text Feature Memory (TFM) module respectively. Finally, the similarity matrix is fused with the results of the classification head for training or inference. The comprehensive experiments conducted on HICO-DET datasets demonstrate that the proposed VT-HOI significantly outperforms the state-of-the-art methods.
What problem does this paper attempt to address?