Human-object interaction detection based on cascade multi-scale transformer

Limin Xia,Xiaoyue Ding
DOI: https://doi.org/10.1007/s10489-024-05324-1
IF: 5.3
2024-02-17
Applied Intelligence
Abstract:Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.
computer science, artificial intelligence
What problem does this paper attempt to address?