Abstract:This paper focuses on Human-Object Interaction (HOI) detection, addressing the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame. Spearheaded by Detection Transformer (DETR), recent developments lead to significant improvements by replacing traditional region proposals by a set of learnable queries. However, despite the powerful representation capabilities provided by Transformers, existing Human-Object Interaction (HOI) detection methods still yield low confidence levels when dealing with complex interactions and are prone to overlooking interactive actions. To address these issues, we propose a novel approach \textsc{UAHOI}, Uncertainty-aware Robust Human-Object Interaction Learning that explicitly estimates prediction uncertainty during the training process to refine both detection and interaction predictions. Our model not only predicts the HOI triplets but also quantifies the uncertainty of these predictions. Specifically, we model this uncertainty through the variance of predictions and incorporate it into the optimization objective, allowing the model to adaptively adjust its confidence threshold based on prediction variance. This integration helps in mitigating the adverse effects of incorrect or ambiguous predictions that are common in traditional methods without any hand-designed components, serving as an automatic confidence threshold. Our method is flexible to existing HOI detection methods and demonstrates improved accuracy. We evaluate \textsc{UAHOI} on two standard benchmarks in the field: V-COCO and HICO-DET, which represent challenging scenarios for HOI detection. Through extensive experiments, we demonstrate that \textsc{UAHOI} achieves significant improvements over existing state-of-the-art methods, enhancing both the accuracy and robustness of HOI detection.

Human-Object Interaction Prediction with Natural Language Supervision

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Learning Transferable Human-Object Interaction Detector with Natural Language Supervision

UAHOI: Uncertainty-aware Robust Interaction Learning for HOI Detection

Exploring Pose-Aware Human-Object Interaction Via Hybrid Learning

Human-Object Interaction Detection via Disentangled Transformer

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

TMHOI: Translational Model for Human-Object Interaction Detection

HODN: Disentangling Human-Object Feature for HOI Detection

Neural-Logic Human-Object Interaction Detection

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Improving Human-Object Interaction Detection via Virtual Image Learning

Zero-Shot Human-Object Interaction Detection via Similarity Propagation

Reformulating HOI Detection as Adaptive Set Prediction

Toward Open-Set Human Object Interaction Detection

Human-Object Interaction Prediction in Videos through Gaze Following

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection