MCT-VHD: Multi-modal contrastive transformer for video highlight detection

Yinhui Jiang,Sihui Luo,Lijun Guo,Rong Zhang
DOI: https://doi.org/10.1016/j.jvcir.2024.104162
IF: 2.887
2024-04-29
Journal of Visual Communication and Image Representation
Abstract:Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.
computer science, information systems, software engineering
What problem does this paper attempt to address?