Abstract:Human-Object Interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, conventional HOI detection methods often struggle to fully capture the contextual information needed to accurately identify these interactions. While large Vision-Language Models (VLMs) show promise in tasks involving human interactions, they are not tailored for HOI detection. The complexity of human behavior and the diverse contexts in which these interactions occur make it further challenging. Contextual cues, such as the participants involved, body language, and the surrounding environment, play crucial roles in predicting these interactions, especially those that are unseen or ambiguous. Moreover, large VLMs are trained on vast image and text data, enabling them to generate contextual cues that help in understanding real-world contexts, object relationships, and typical interactions. Building on this, in this paper we introduce ConCue, a novel approach for improving visual feature extraction in HOI detection. Specifically, we first design specialized prompts to utilize large VLMs to generate contextual cues within an image. To fully leverage these cues, we develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors. Extensive experiments and analyses demonstrate the effectiveness of using these contextual cues for HOI detection. The experimental results show that integrating ConCue with existing state-of-the-art methods significantly enhances their performance on two widely used datasets.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of insufficient contextual information in Human-Object Interaction (HOI) detection. Specifically, traditional HOI detection methods often fail to fully capture the necessary contextual information when identifying interactions between humans and objects, leading to inaccurate interaction classification. Although large Vision-Language Models (VLMs) perform well in tasks involving human interactions, they are not specifically optimized for HOI detection. Additionally, the complexity of human behavior and the diverse contexts in which interactions occur further exacerbate this challenge. ### Main Issues 1. **Insufficient Contextual Information**: Traditional methods mainly rely on visual information in images while neglecting contextual information such as the identity of participants, body language, and surrounding environment, which are crucial for accurately predicting interactions. 2. **Complex Interaction Understanding**: The diversity and complexity of human behavior make it difficult to accurately identify certain interactions based solely on visual features, especially those that are uncommon or ambiguous. 3. **Limitations of Existing Models**: Although large VLMs excel in aligning visual and textual data, their ability to recognize complex interactions is limited because the training data for these models may not cover all types of interactions. ### Solution To address the above issues, the authors propose a new method called ConCue, which enhances HOI detection performance by utilizing contextual cues generated by large VLMs. The specific steps are as follows: 1. **Generating Contextual Cues**: A set of specialized prompts is designed to extract contextual cues from large VLMs, including participant cues, body language cues, environmental cues, and temporal cues. 2. **Feature Extraction Module**: A Transformer-based multi-tower architecture feature extraction module is developed, which integrates contextual cues into instance and interaction detectors to enhance visual feature extraction. 3. **Experimental Validation**: Extensive experiments on two widely used HOI detection benchmark datasets validate the effectiveness and practicality of ConCue. ### Summary The main contributions of the paper are: - Identifying the limitations of traditional methods and large VLMs in HOI detection and proposing the ConCue method to overcome these limitations. - Emphasizing the critical role of contextual cues in recognizing complex interactions and designing a set of specialized prompts to generate these cues. - Introducing a context cue-based visual feature extraction method that effectively guides and enhances the feature extraction process. - Validating the effectiveness, interoperability, and practical application value of ConCue on two widely recognized HOI detection benchmark datasets.

Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Generating Human-Centric Visual Cues for Human-Object Interaction Detection via Large Vision-Language Models

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

Visual Compositional Learning for Human-Object Interaction Detection

Amplifying Key Cues for Human-Object-Interaction Detection

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration

Toward Open-Set Human Object Interaction Detection

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Human Object Interaction Detection via Multi-level Conditioned Network

In vivo vascular freezing in clinical microvascular transfer

Contextual Object Detection with Multimodal Large Language Models

HODN: Disentangling Human-Object Feature for HOI Detection