Abstract:Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named "Multiple-Relation Extraction in Videos (MREV)". To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

Multi-modal recommendation algorithm fusing visual and textual features

Video retrieval with multi-modal features.

Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism

Coarse-to-fine dual-level attention for video-text cross modal retrieval

A Multimodal Approach for Multiple-Relation Extraction in Videos

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

A multi-modal fusion approach for measuring web video relatedness

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Multiple cross-attention for video-subtitle moment retrieval

UATVR: Uncertainty-Adaptive Text-Video Retrieval

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing