Abstract:Human-Object Interaction (HOI), as an important problem in computer vision, requires locating the human-object pair and identifying the interactive relationships between them. The HOI instance has a greater span in spatial, scale, and task than the individual object instance, making its detection more susceptible to noisy backgrounds. To alleviate the disturbance of noisy backgrounds on HOI detection, it is necessary to consider the input image information to generate fine-grained anchors which are then leveraged to guide the detection of HOI instances. However, it has the following challenges. i) how to extract pivotal features from the images with complex background information is still an open question. ii) how to semantically align the extracted features and query embeddings is also a difficult issue. In this paper, a novel end-to-end transformer-based framework (FGAHOI) is proposed to alleviate the above problems. FGAHOI comprises three dedicated components namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM) and task-aware merging mechanism (TAM). MSS extracts features of humans, objects and interaction areas from noisy backgrounds for HOI instances of various scales. HSAM and TAM semantically align and merge the extracted features and query embeddings in the hierarchical spatial and task perspectives in turn. In the meanwhile, a novel training strategy Stage-wise Training Strategy is designed to reduce the training pressure caused by overly complex tasks done by FGAHOI. In addition, we propose two ways to measure the difficulty of HOI detection and a novel dataset, i.e., HOI-SDC for the two challenges (Uneven Distributed Area in Human-Object Pairs and Long Distance Visual Modeling of Human-Object Pairs) of HOI instances detection. Experiments are conducted on three benchmarks: HICO-DET, HOI-SDC and V-COCO. Our model outperforms the state-of-the-art HOI detection methods, and the extensive ablations reveal the merits of our proposed contribution.

Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization.

Modeling 4d Human-Object Interactions for Event and Object Recognition

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

In-Hand 3D Object Reconstruction from a Monocular RGB Video

HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction

UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

Hi4D: 4D Instance Segmentation of Close Human Interaction

Learning Human-Object Interaction via Interactive Semantic Reasoning

Human-object Interaction Detection with Depth-Augmented Clues

Kinematics-based 3D Human-Object Interaction Reconstruction from Single View

HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

TMHOI: Translational Model for Human-Object Interaction Detection

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection.

Cascaded Human-Object Interaction Recognition

Exploring Pose-Aware Human-Object Interaction Via Hybrid Learning

From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos