Abstract:The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics such as character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes, and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this article, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.

Shifted GCN-GAT and Cumulative-Transformer Based Social Relation Recognition for Long Videos.

Overall-Distinctive GCN for Social Relation Recognition on Videos.

Learning Social Spatio-Temporal Relation Graph in the Wild and a Video Benchmark.

Social Relation Recognition from Videos Via Multi-Scale Spatial-Temporal Reasoning

Recognizing Social Relationships in Long Videos Via Multimodal Character Interaction

Social Relation Graph Generation on Untrimmed Video.

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Video Relation Detection with Spatio-Temporal Graph

Linking the Characters

Multi-stream Fusion Model for Social Relation Recognition from Videos.

InteractNet: Social Interaction Recognition for Semantic-rich Videos

Social Relation Analysis from Videos Via Multi-entity Reasoning

Attentive Sequences Recurrent Network for Social Relation Recognition from Video

Video Captioning Via Relation-Aware Graph Learning

Toward jointly understanding social relationships and characters from videos

Group Activity Recognition by Using Effective Multiple Modality Relation Representation with Temporal-Spatial Attention

Multi-Modal Multi-Action Video Recognition.

Recognizing Characters and Relationships from Videos Via Spatial-Temporal and Multimodal Cues

Relation Extraction from Videos Based on IoT Intelligent Collaboration Framework

When I Fall in Love: Capturing Video-oriented Social Relationship Evolution Via Attentive GNN

Target Adaptive Context Aggregation for Video Scene Graph Generation