CMFF_VS:A Video Summarization Extraction Model Based on Cross-modal Feature Fusion

Chaoqun Xin,Mingyang Wang,Xianhao Zhao
DOI: https://doi.org/10.21203/rs.3.rs-5004063/v1
2024-01-01
Abstract:Video summarization aims to present the most relevant and important information in the video stream in the form of a summary. Most existing researches focus on the selection process of keyframes, determining the importance of video frames by obtaining dependency information between them. However, these works overlook the feature extraction process of video frames. In fact, rich and reliable video frame features are an important basis for determining whether video frames can be selected correctly. This article proposes a cross-modal video summarization extraction model CMFF_VS by extracting deep semantic information from video frames. CMFF_VS model utilizes the mutual enhancement of video modality and text modality to extract richer semantic information of video frames, thereby providing necessary features for the subsequent video frame selection process. To solve the alignment problem between semantic information of two modalities, CMFF_VS introduces a cross-modal attention mechanism, which utilizes the semantic correlation of modalities to achieve cross-modal semantic fusion. At the same time, CMFF_VS introduces the ASPP module to extract and fuse multi-scale semantic features of individual modalities, enriching the capture of advanced semantic information for each modality. The experimental results show that compared with the state-of-art unimodal and multimodal video summarization models, CMFF-VS achieves the best performance, indicating that the cross-modal deep feature extraction and fusion strategy proposed in CMFF-VS is reasonable and effective.
What problem does this paper attempt to address?