Interaction is Worth More Explanations: Improving Human-Object Interaction Representation with Propositional Knowledge
Feng Yang,Yichao Cao,Xuanpeng Li,Weigong Zhang
DOI: https://doi.org/10.1109/tcds.2024.3496566
IF: 4.546
2024-01-01
IEEE Transactions on Cognitive and Developmental Systems
Abstract:Detecting Human-Object Interactions (HOI) presents a formidable challenge, necessitating the discernment of intricate, high-level relationships between humans and objects. Recent studies have explored HOI Vision-and-Language Modeling (HOIVLM), which leverages linguistic information inspired by cross-modal technology. Despite its promise, current methodologies face challenges due to the constraints of limited annotation vocabularies and suboptimal word embeddings, which hinder effective alignment with visual features and consequently, the efficient transfer of linguistic knowledge. In this work, we propose a novel cross-modal framework that leverages external propositional knowledge which harmonize annotation text with a broader spectrum of world knowledge, enabling a more explicit and unambiguous representation of complex semantic relationships. Additionally, considering the prevalence of multiple complexities due to the symbiotic or distinctive relationships inherent in one HO pair, along with the identical interactions occurring with diverse HO pairs (e.g., “human ride bicycle” vs. “human ride horse”). The challenge lies in understanding the subtle differences and similarities between interactions involving different objects or occurring in varied contexts. To this end, we propose the Jaccard contrast strategy to simultaneously optimize cross-modal representation consistency across HO pairs (especially for cases where multiple interactions occur), which encompasses both vision-to-vision and vision-to-knowledge alignment objectives. The effectiveness of our proposed method is comprehensively validated through extensive experiments, showcasing its superiority in the field of HOI analysis.