Abstract:Cameras are essential vision instruments to capture images for pattern detection and measurement. Human-object interaction (HOI) detection is one of the most popular pattern detection approaches for captured human-centric visual scenes. Recently, Transformer-based models have become the dominant approach for HOI detection due to their advanced network architectures and thus promising results. However, most of them follow the one-stage design of vanilla Transformer, leaving rich geometric priors under-exploited and leading to compromised performance especially when occlusion occurs. Given that geometric features tend to outperform visual ones in occluded scenarios and offer information that complements visual cues, we propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI). One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet that bridges the gap of consistent keypoint representation across diverse object categories, including humans. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions as well as local keypoint patches to enhance interaction query representation, so as to boost HOI predictions. Extensive experiments show that the proposed method outperforms the state-of-the-art models on V-COCO and achieves competitive performance on HICO-DET. Case study results on the post-disaster rescue with vision-based instruments showcase the applicability of the proposed GeoHOI in real-world applications.

Human–object interaction detection based on disentangled axial attention transformer

Human-Object Interaction Detection via Disentangled Transformer

Parallel disentangling network for human–object interaction detection

Human-object interaction detection based on cascade multi-scale transformer

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection

HODN: Disentangling Human-Object Feature for HOI Detection

Multi-Scale Human-Object Interaction Detector.

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Category-Aware Transformer Network for Better Human-Object Interaction Detection

A Novel Part Refinement Tandem Transformer for Human-Object Interaction Detection

Pairwise CNN-Transformer Features for Human–Object Interaction Detection

Geometric Features Enhanced Human-Object Interaction Detection

Neural-Logic Human-Object Interaction Detection

Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection

Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows

Geometric Features Enhanced Human–Object Interaction Detection

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Focus and Adjust: Progressive Refinement Network for Human Object Interaction Detection

Learning Transferable Human-Object Interaction Detector with Natural Language Supervision

Few-shot human-object interaction video recognition with transformers