Temporal Enhancement for Video Affective Content Analysis

Xin Li,Shangfei Wang,Xuandong Huang
DOI: https://doi.org/10.1145/3664647.3681631
2024-01-01
Abstract:With the popularity and advancement of the Internet and video-sharing platforms, video affective content analysis has greatly developed. Temporal information is crucial for this task. Nevertheless, existing methods often overlook the fact that there is substantial irrelevant information in videos and that the importance of modalities is uneven for emotional tasks. This could result in noise from both temporal fragments and modalities, reducing the model's ability to identify crucial temporal fragments and recognize emotions. To tackle the above issues, we propose a Temporal Enhancement (TE) method in this paper. Specifically, we utilize three encoders for extracting features at various levels and employ temporal sampling to enhance the temporal data, thereby enriching video representation and improving the model's robustness to noise. Subsequently, we design a cross-modal temporal enhancement module to enhance temporal information for every modal feature. This module interacts with multiple modalities simultaneously to emphasize critical temporal fragments while suppressing irrelevant ones. The experimental results on four benchmark datasets show that the proposed temporal enhancement method achieves state-of-the-art video affective content analysis performance. Moreover, the effectiveness of each module is confirmed through ablation experiments.
What problem does this paper attempt to address?