Human-Object Interaction Prediction with Natural Language Supervision

Zhengxue Li,Gaoyun An
DOI: https://doi.org/10.1109/ICSP56322.2022.9965210
2022-01-01
Abstract:Although the dataset for the HOI task already contains a rich set of Human-Object Interaction types, it is impractical to label and learns all (object-interaction) combinations since the same objects can have different categories of interactions with humans. When some uncommon interaction combinations occur in real application scenarios, it is difficult for existing models to make correct predictions. To address these issues, we propose a novel Transformer-based HOI prediction model. The model converts the triad labels (human-interaction-object) of HOI tasks into natural language descriptions of images and uses the converted description sentences as new image labels to predict their interactions in the space of joint natural language and HOI interaction features. This approach transforms the image to triplet mapping problem into a mapping problem from image to natural language, so it can deal with uncommon HOI interaction combinations. In addition, we use a new image Precise Relative Position Embedding method for enhancing the distance perception between image instances and enhancing the instance relevance detection in the joint space. We can also apply our model to zero-sample learning experiments since it can identify new interaction combinations. Extensive experiments on the datasets SWIG-HOI and HICO-DET show that our model is noticeably improved compared to previous methods.
What problem does this paper attempt to address?