Abstract:Recognition of human activities is an essential field in computer vision. The most human activity consists of the interaction between humans and objects. Many successful works have been done on human-object interaction (HOI) recognition and achieved acceptable results in recent years. Still, they are fully supervised and need to train labeled data for all HOIs. Due to the enormous space of human-object interactions, listing and providing the training data for all possible categories is costly and impractical. We propose an approach for scaling human-object interaction recognition in video data through the zero-shot learning technique to solve this problem. Our method recognizes a verb and an object from the video and makes an HOI class. Recognition of the verbs and objects instead of HOIs allows identifying a new combination of verbs and objects. So, a new HOI class can be identified, which is not seen by the recognizer system. We introduce a neural network architecture that can understand and represent the video data. The proposed system learns verbs and objects from available training data at the training phase and can identify the verb-object pairs in a video at test time. So, the system can identify the HOI class with different combinations of objects and verbs. Also, we propose to use lateral information for combining the verbs and the objects to make valid verb-object pairs. It helps to prevent the detection of rare and probably wrong HOIs. The lateral information comes from word embedding techniques. Furthermore, we propose a new feature aggregation method for aggregating extracted high-level features from video frames before feeding them to the classifier. We illustrate that this feature aggregation method is more effective for actions that include multiple subactions. We evaluated our system by recently introduced Charades challengeable dataset, which has lots of HOI categories in videos. We show that our proposed system can detect unseen HOI classes in addition to the acceptable recognition of seen types. Therefore, the number of classes identifiable by the system is greater than the number of classes used for training.

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Generating Human-Centric Visual Cues for Human-Object Interaction Detection via Large Vision-Language Models

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

Toward Open-Set Human Object Interaction Detection

Visualization As Intermediate Representations (VLAIR) for Human Activity Recognition.

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

Open-Category Human-Object Interaction Pre-Training Via Language Modeling Framework

OVMR: Open-Vocabulary Recognition with Multi-Modal References

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Surveillance Video-and-Language Understanding: from Small to Large Multimodal Models

Multi-Modal Classifiers for Open-Vocabulary Object Detection

HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data

CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models

Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

Towards Open-Vocabulary Video Instance Segmentation

NMM-HRI: Natural Multi-modal Human-Robot Interaction with Voice and Deictic Posture via Large Language Model

Human-to-Human Interaction Detection