Abstract:Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named "Multiple-Relation Extraction in Videos (MREV)". To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

Fine-Grained Cross-Modal Retrieval with Triple-Streamed Memory Fusion Transformer Encoder

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

A Multimodal Approach for Multiple-Relation Extraction in Videos

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

Visual Relations Augmented Cross-modal Retrieval

Video Relation Detection with Spatio-Temporal Graph

Video Visual Relation Detection Via Multi-modal Feature Fusion

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Video Relation Detection via Tracklet based Visual Transformer

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.