Abstract:Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called \textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories\footnote{<a class="link-external link-https" href="https://github.com/LijunZhang01/CEFA" rel="external noopener nofollow">this https URL</a>}.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to improve the human - object interaction (HOI) detection performance of rare categories by bridging the domain gap between the generated data and the original data**.
### Problem Background
Human - Object Interaction (HOI) detection aims to capture human - object pairs in an image and their corresponding actions. However, due to natural biases in the real world, existing methods perform poorly when dealing with rare human - object pairs, resulting in sub - optimal results. Although the development of generative models has made it possible to construct more balanced datasets, there is a significant domain gap between the generated data and the original data, and simply merging the generated data into the original dataset cannot significantly improve the model performance.
### Method Proposed in the Paper
To solve the above problems, the paper proposes a new model - agnostic framework named **Context - Enhanced Feature Alignment (CEFA)**, which can align the generated data with the original data at the feature level, thereby bridging the domain gap. Specifically, CEFA contains two modules:
1. **Instance Feature Alignment Module**:
- This module aligns human - object pairs by aggregating instance information.
- It uses a graph - based Prototype Instance Alignment Module (PIAM), regards high - scoring tokens as prototypes, and constructs a graph network to align human - object pairs, so as to better aggregate instance information.
2. **Context Enhancement Module**:
- This module uses a context - enhanced image reconstruction branch to improve the model's ability to capture context clues.
- It randomly masks a part of the generated image and uses the features of the original image as auxiliary information to help restore the masked part, thereby enhancing the model's understanding of the context.
### Experimental Verification
The paper conducted experiments on two benchmark datasets, HICO - Det and V - COCO. The results show that the CEFA module can effectively improve the HOI detection performance of rare categories. For example, on the HICO - Det dataset, the CEFA module increased the rare - category accuracy of the CDN model by 1.11%, the GEN - VLKT model by 1.49%, and the HOICLIP model by 1.59%.
### Summary
The main contributions of the paper include:
- Proposing a new model - agnostic method, CEFA, which can effectively bridge the domain gap between the generated data and the original data.
- Constructing an instance feature alignment module and a context enhancement module, which are respectively used to align instance information and enhance context understanding.
- As a plug - and - play module, CEFA can be conveniently applied to existing HOI models and significantly improve the detection performance of rare categories.
Through these improvements, the paper provides an effective method to deal with the long - tail problem in HOI detection, especially the performance in dealing with rare categories has been significantly improved.