Abstract:In this paper, we introduce a novel human interaction detection approach, based on CALIPSO (Classifying ALl Interacting Pairs in a Single shOt), a classifier of human-object interactions. This new single-shot interaction classifier estimates interactions simultaneously for all human-object pairs, regardless of their number and class. State-of-the-art approaches adopt a multi-shot strategy based on a pairwise estimate of interactions for a set of human-object candidate pairs, which leads to a complexity depending, at least, on the number of interactions or, at most, on the number of candidate pairs. In contrast, the proposed method estimates the interactions on the whole image. Indeed, it simultaneously estimates all interactions between all human subjects and object targets by performing a single forward pass throughout the image. Consequently, it leads to a constant complexity and computation time independent of the number of subjects, objects or interactions in the image. In detail, interaction classification is achieved on a dense grid of anchors thanks to a joint multi-task network that learns three complementary tasks simultaneously: (i) prediction of the types of interaction, (ii) estimation of the presence of a target and (iii) learning of an embedding which maps interacting subject and target to a same representation, by using a metric learning strategy. In addition, we introduce an object-centric passive-voice verb estimation which significantly improves results. Evaluations on the two well-known Human-Object Interaction image datasets, V-COCO and HICO-DET, demonstrate the competitiveness of the proposed method (2nd place) compared to the state-of-the-art while having constant computation time regardless of the number of objects and interactions in the image.

Weakly supervised learning of interactions between humans and objects

Explicit modeling of human-object interactions in realistic videos

Detecting and Recognizing Human-Object Interactions

Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance

Weakly-Supervised Object Detection Learning through Human-Robot Interaction

A self-organizing neural network architecture for learning human-object interactions

Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses

A weakly supervised CNN model for spatial localization of human activities in unconstraint environment

A novel hierarchical interaction model and HITS map for action recognition in static images

Learning Human-Human Interactions in Images from Weak Textual Supervision

Spatio-Temporal Action Localization in a Weakly Supervised Setting

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Learning Human-Object Interaction Detection Using Interaction Points

Exploring Pose-Aware Human-Object Interaction Via Hybrid Learning

Classifying All Interacting Pairs in a Single Shot

Learning Object Spatial Relationship from Demonstration

Exploiting Human Pose and Scene Information for Interaction Detection

Weakly supervised learning of actions from transcripts

Weakly Supervised Segmentation Guided Hand Pose Estimation During Interaction with Unknown Objects.

Human Object Interaction Detection using Two-Direction Spatial Enhancement and Exclusive Object Prior

Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the Wild