Abstract:Person search has long been treated as a crucial and challenging task to support deeper insight in personalized summarization and personality discovery. Traditional methods, e.g., person re-identification and face recognition techniques, which profile video characters based on visual information, are often limited by relatively fixed poses or small variation of viewpoints and suffer from more realistic scenes with high motion complexity (e.g., movies). At the same time, long videos such as movies often have logical story lines and are composed of continuously developmental plots. In this situation, different persons usually meet on a specific occasion, in which informative social cues are performed. We notice that these social cues could semantically profile their personality and benefit person search task in two aspects. First, persons with certain relationships usually co-occur in short intervals; in case one of them is easier to be identified, the social relation cues extracted from their co-occurrences could further benefit the identification for the harder ones. Second, social relations could reveal the association between certain scenes and characters (e.g., classmate relationship may only exist among students), which could narrow down candidates into certain persons with a specific relationship. In this way, high-level social relation cues could improve the effectiveness of person search. Along this line, in this article, we propose a social context-aware framework, which fuses visual and social contexts to profile persons in more semantic perspectives and better deal with person search task in complex scenarios. Specifically, we first segment videos into several independent scene units and abstract out social contexts within these scene units. Then, we construct inner-personal links through a graph formulation operation for each scene unit, in which both visual cues and relation cues are considered. Finally, we perform a relation-aware label propagation to identify characters’ occurrences, combining low-level semantic cues (i.e., visual cues) and high-level semantic cues (i.e., relation cues) to further enhance the accuracy. Experiments on real-world datasets validate that our solution outperforms several competitive baselines.

Two-Stage Model for Social Relationship Understanding from Videos

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Toward jointly understanding social relationships and characters from videos

Social Context-aware Person Search in Videos via Multi-modal Cues

Learning a Probabilistic Semantic Model from Heterogeneous Social Networks for Relationship Identification

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Video Relation Detection with Spatio-Temporal Graph

Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection

Non-parametric Contextual Relationship Learning for Semantic Video Object Segmentation

Learning spatial-temporal models for understanding actions and events in video

A Multimodal Approach for Multiple-Relation Extraction in Videos

Character Matters: Video Story Understanding with Character-Aware Relations

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Multi-Granularity Reasoning for Social Relation Recognition From Images

InteractNet: Social Interaction Recognition for Semantic-rich Videos

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

Video Structural Description: A Semantic Based Model for Representing and Organizing Video Surveillance Big Data

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling