Abstract:Detecting Human-Object Interactions (HOI) presents a formidable challenge, necessitating the discernment of intricate, high-level relationships between humans and objects. Recent studies have explored HOI Vision-and-Language Modeling (HOIVLM), which leverages linguistic information inspired by cross-modal technology. Despite its promise, current methodologies face challenges due to the constraints of limited annotation vocabularies and suboptimal word embeddings, which hinder effective alignment with visual features and consequently, the efficient transfer of linguistic knowledge. In this work, we propose a novel cross-modal framework that leverages external propositional knowledge which harmonize annotation text with a broader spectrum of world knowledge, enabling a more explicit and unambiguous representation of complex semantic relationships. Additionally, considering the prevalence of multiple complexities due to the symbiotic or distinctive relationships inherent in one HO pair, along with the identical interactions occurring with diverse HO pairs (e.g., “human ride bicycle” vs. “human ride horse”). The challenge lies in understanding the subtle differences and similarities between interactions involving different objects or occurring in varied contexts. To this end, we propose the Jaccard contrast strategy to simultaneously optimize cross-modal representation consistency across HO pairs (especially for cases where multiple interactions occur), which encompasses both vision-to-vision and vision-to-knowledge alignment objectives. The effectiveness of our proposed method is comprehensively validated through extensive experiments, showcasing its superiority in the field of HOI analysis.

Learning from Easy to Hard Pairs: Multi-step Reasoning Network for Human-Object Interaction Detection

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Learning Human-Object Interaction via Interactive Semantic Reasoning

Learning Human-Object Interaction Detection Using Interaction Points

Hierarchical Reasoning Network with Contrastive Learning for Few-Shot Human-Object Interaction Recognition

Interaction is Worth More Explanations: Improving Human-Object Interaction Representation with Propositional Knowledge

Hierarchical Reasoning Network for Human-Object Interaction Detection

Detecting Human—object Interaction with Multi-Level Pairwise Feature Network

Parallel Reasoning Network for Human-Object Interaction Detection

Human Object Interaction Detection via Multi-level Conditioned Network

Multi-branch Graph Network for Learning Human-Object Interaction.

Action-Guided Attention Mining and Relation Reasoning Network for Human-Object Interaction Detection

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions

Exploring Pose-Aware Human-Object Interaction Via Hybrid Learning

Learning to Detect Human-Object Interactions

Spatial-Aware Multi-Level Parsing Network for Human-Object Interaction

Reformulating HOI Detection as Adaptive Set Prediction

RR-Net: Relation Reasoning for End-to-End Human-Object Interaction Detection

Mining the Benefits of Two-stage and One-stage HOI Detection.

Spatial Parsing and Dynamic Temporal Pooling Networks for Human-Object Interaction Detection

Cascaded Human-Object Interaction Recognition