Abstract:Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.

Mask-Guided Transformer for Human-Object Interaction Detection

Multi-Scale Human-Object Interaction Detector.

End-to-End Human Object Interaction Detection with HOI Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Human-Object Interaction Detection via Disentangled Transformer

Human–object interaction detection based on disentangled axial attention transformer

Geometric Features Enhanced Human-Object Interaction Detection

Human-object interaction detection based on cascade multi-scale transformer

Pairwise CNN-Transformer Features for Human–Object Interaction Detection

Geometric Features Enhanced Human–Object Interaction Detection

Parallel disentangling network for human–object interaction detection

GTNet:Guided Transformer Network for Detecting Human-Object Interactions

Category-Aware Transformer Network for Better Human-Object Interaction Detection

A Novel Part Refinement Tandem Transformer for Human-Object Interaction Detection

ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection

MMFENet:Multi-Modal Feature Enhancement Network with Transformer for Human-Object Interaction Detection

Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection

Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Toward a Unified Transformer-Based Framework for Scene Graph Generation and Human-Object Interaction Detection

HODN: Disentangling Human-Object Feature for HOI Detection