ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Mingda Jia,Liming Zhao,Ge Li,Yun Zheng
2024-12-12
Abstract:Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in human - object interaction (HOI) detection, when foreground instances (such as people or objects) are blurry or occluded, existing methods have difficulty accurately identifying interactions. Specifically: 1. **Limitations of existing HOI detectors**: - Existing HOI detectors mainly rely on Transformer - based object detection pipelines. These methods perform well in locating objects but are insufficient in exploring spatial context (such as the background and the surrounding environment). - When foreground instances are blurry or occluded, existing methods perform poorly because they rely too much on instance - center attributes and ignore the importance of spatial context. 2. **Differences between humans and machines**: - Humans can accurately identify HOI even when instances are not clear or completely invisible. For example, even if the driver is blocked by the car window, we can still infer that the driver is driving the car. - This ability indicates that humans use rich spatial context information to make up for the lack of visual cues. 3. **The core of the problem**: - The core problem is that existing HOI detectors are insufficient in modeling spatial context and cannot fully utilize background information to assist in identifying interaction behaviors like humans. To solve these problems, the author proposes ContextHOI, which is a two - branch framework aimed at enhancing the ability of HOI detection by introducing spatial context learning. Specific improvements include: - **Two - branch framework**: One branch is used to capture instance features, and the other branch is used to learn spatial context features. - **Spatial contrast constraint**: Through multi - level spatial contrast constraints, the model can distinguish between instance regions and background regions and avoid the influence of background noise on instance detection. - **Semantic - guided context exploration**: Utilize the knowledge of pre - trained vision - language models (VLM) to help the model better understand context information. - **Context aggregator**: Fuse instance features and context features to improve the accuracy of HOI prediction. Through these improvements, ContextHOI has achieved state - of - the - art performance on the HICO - DET and v - coco benchmarks and performs particularly well in scenarios dealing with blurry or occluded instances.