Abstract:Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named "Multiple-Relation Extraction in Videos (MREV)". To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

Deep Relationship Analysis in Video with Multimodal Feature Fusion

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Emotion Recognition in Videos via Fusing Multimodal Features.

Hybrid Improvements in Multimodal Analysis for Deep Video Understanding

Video Visual Relation Detection Via Multi-modal Feature Fusion

Online video visual relation detection with hierarchical multi-modal fusion

A Multimodal Approach for Multiple-Relation Extraction in Videos

Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

A multi-modal fusion approach for measuring web video relatedness

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Magic bullet in management of Peyronie's Disease

Multimodal Deep Representation Learning for Video Classification

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

Multimodal feature fusion based on object relation for video captioning

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Deep Multimodal Feature Encoding for Video Ordering

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism