Abstract:Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at <a class="link-external link-https" href="https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos" rel="external noopener nofollow">this https URL</a>.

Keeping in Time: Adding Temporal Context to Sentiment Analysis Models

Temporal Effects on Pre-trained Models for Language Processing Tasks

Towards Effective Time-Aware Language Representation: Exploring Enhanced Temporal Understanding in Language Models

Lifelong Text-Audio Sentiment Analysis Learning

Temporal Sentiment Localization: Listen and Look in Untrimmed Videos

Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition

Improving Event Temporal Relation Classification via Auxiliary Label-Aware Contrastive Learning.

Temporal Context Consistency Above All: Enhancing Long-Term Anticipation by Learning and Enforcing Temporal Constraints

A Systematic Analysis on the Temporal Generalization of Language Models in Social Media

Efficient Continue Training of Temporal Language Model with Structural Information

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks

Temporal Validity Change Prediction

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

Learning Abstract Snippet Detectors with Temporal Embedding in Convolutional Neural Networks

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

Label Attention Network for Temporal Sets Prediction: You Were Looking at a Wrong Self-Attention

NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

Temporal Enhancement for Video Affective Content Analysis

BiTimeBERT: Extending Pre-Trained Language Representations with Bi-Temporal Information

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos