Abstract:Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called \textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories\footnote{<a class="link-external link-https" href="https://github.com/LijunZhang01/CEFA" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to improve the human - object interaction (HOI) detection performance of rare categories by bridging the domain gap between the generated data and the original data**. ### Problem Background Human - Object Interaction (HOI) detection aims to capture human - object pairs in an image and their corresponding actions. However, due to natural biases in the real world, existing methods perform poorly when dealing with rare human - object pairs, resulting in sub - optimal results. Although the development of generative models has made it possible to construct more balanced datasets, there is a significant domain gap between the generated data and the original data, and simply merging the generated data into the original dataset cannot significantly improve the model performance. ### Method Proposed in the Paper To solve the above problems, the paper proposes a new model - agnostic framework named **Context - Enhanced Feature Alignment (CEFA)**, which can align the generated data with the original data at the feature level, thereby bridging the domain gap. Specifically, CEFA contains two modules: 1. **Instance Feature Alignment Module**: - This module aligns human - object pairs by aggregating instance information. - It uses a graph - based Prototype Instance Alignment Module (PIAM), regards high - scoring tokens as prototypes, and constructs a graph network to align human - object pairs, so as to better aggregate instance information. 2. **Context Enhancement Module**: - This module uses a context - enhanced image reconstruction branch to improve the model's ability to capture context clues. - It randomly masks a part of the generated image and uses the features of the original image as auxiliary information to help restore the masked part, thereby enhancing the model's understanding of the context. ### Experimental Verification The paper conducted experiments on two benchmark datasets, HICO - Det and V - COCO. The results show that the CEFA module can effectively improve the HOI detection performance of rare categories. For example, on the HICO - Det dataset, the CEFA module increased the rare - category accuracy of the CDN model by 1.11%, the GEN - VLKT model by 1.49%, and the HOICLIP model by 1.59%. ### Summary The main contributions of the paper include: - Proposing a new model - agnostic method, CEFA, which can effectively bridge the domain gap between the generated data and the original data. - Constructing an instance feature alignment module and a context enhancement module, which are respectively used to align instance information and enhance context understanding. - As a plug - and - play module, CEFA can be conveniently applied to existing HOI models and significantly improve the detection performance of rare categories. Through these improvements, the paper provides an effective method to deal with the long - tail problem in HOI detection, especially the performance in dealing with rare categories has been significantly improved.

A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

Effective Actor-centric Human-object Interaction Detection

Detecting Human-Object Interaction Via Fabricated Compositional Learning.

Exploring Pose-Aware Human-Object Interaction Via Hybrid Learning

Affordance Transfer Learning for Human-Object Interaction Detection

FreeA: Human-object Interaction Detection using Free Annotation Labels

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection.

Human-object Interaction Detection with Depth-Augmented Clues

Human-Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration

Chairs Can Be Stood On: Overcoming Object Bias in Human-Object Interaction Detection

Visual Compositional Learning for Human-Object Interaction Detection

Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Cascaded Human-Object Interaction Recognition

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

Amplifying Key Cues for Human-Object-Interaction Detection

Reformulating HOI Detection as Adaptive Set Prediction

Discovering Human-Object Interaction Concepts via Self-Compositional Learning.