Abstract:Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named "Multiple-Relation Extraction in Videos (MREV)". To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Knowledge Representation Learning with Entity Descriptions, Hierarchical Types, and Textual Relations

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Relation Extraction with Knowledge-Enhanced Prompt-Tuning on Multimodal Knowledge Graph

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

A Multimodal Approach for Multiple-Relation Extraction in Videos

Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Named Entity and Relation Extraction with Multi-Modal Retrieval

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction

A Hierarchical Network for Multimodal Document-Level Relation Extraction

CGI-MRE: A Comprehensive Genetic-Inspired Model For Multimodal Relation Extraction