Abstract:Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named "Multiple-Relation Extraction in Videos (MREV)". To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

Social Relation Recognition from Videos Via Multi-Scale Spatial-Temporal Reasoning

Social Relation Analysis from Videos Via Multi-entity Reasoning

Learning Social Spatio-Temporal Relation Graph in the Wild and a Video Benchmark.

Multi-Granularity Reasoning for Social Relation Recognition From Images

Multi-stream Fusion Model for Social Relation Recognition from Videos.

Recognizing Social Relationships in Long Videos Via Multimodal Character Interaction

Video Relation Detection with Spatio-Temporal Graph

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Shifted GCN-GAT and Cumulative-Transformer Based Social Relation Recognition for Long Videos.

A Multimodal Approach for Multiple-Relation Extraction in Videos

SRE-Net Model for Automatic Social Relation Extraction from Video.

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Progressive Graph Reasoning-Based Social Relation Recognition

Recognizing Characters and Relationships from Videos Via Spatial-Temporal and Multimodal Cues

Multi-Level Transformer-Based Social Relation Recognition

Spatio-Temporal Triangular-Chain Crf For Activity Recognition

Attentive Sequences Recurrent Network for Social Relation Recognition from Video

Principal Relation Component Reasoning-Enhanced Social Relation Recognition

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

Toward jointly understanding social relationships and characters from videos

Relation Extraction from Videos Based on IoT Intelligent Collaboration Framework