Abstract:Video visual relation inference refers to the task of automatically detecting the relation triplets between the observed objects in videos with the form of ${ < subject, predicate, object>}$ , which requires correctly labeling each detected object and their interaction predicates. Despite the recent advances in image visual relation detection using deep learning techniques, relation inference in videos remains a challenging topic. On one hand, since the introduction of temporal information, it needs to model the rich spatio-temporal visual information for objects and videos. On the other hand, wild videos are often annotated with incomplete relation triplet tags and a few of them are semantically overlapped. However, previous methods adopt hand-crafted visual features extracted from the trajectories, describing local appearance characteristics of isolated objects. And they treat the problem as a multi-class classification task, which makes the relation tags mutually exclusive. To address the above issues, we propose a novel model, termed Visual-Semantic Relation Network (VSRN). In this network, we leverage three-dimensional convolution kernel to capture spatio-temporal features, and encode global visual features in videos through pooling operation on each time slice. Moreover, the semantic collocations between objects are also incorporated so as to obtain comprehensive representations of the relationships. For relation classification, we treat the problem as a multi-label classification task and regard each tag to be independent to predict various relationships. Additionally, we modify commonly used evaluation metric, video-wise recall, to a pair-wise metric (Roop) for testing the performance of models in predicting multiple relationships for the object pairs, Extensive experimental results on two large-scale datasets demonstrate the effectiveness of our proposed model which significantly outperforms the previous works.

Visual Semantic Role Labeling for Video Understanding

Structured Label Inference for Visual Understanding.

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Video Action Recognition with Attentive Semantic Units

Grounded Video Situation Recognition

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

MovieCLIP: Visual Scene Recognition in Movies

Look Before you Speak: Visually Contextualized Utterances

Seeing the Unseen: Visual Common Sense for Semantic Placement

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Video In Sentences Out

Learning Visual Actions Using Multiple Verb-Only Labels

Semantic learning network for controllable video captioning

VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference

Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

Large-Scale Video Understanding with Limited Training Labels

Semantic Model Vectors for Complex Video Event Recognition