MTIDNet: A Multimodal Temporal Interest Detection Network for Video Summarization

Xiaoyan Tian,Ye Jin,Zhao Zhang,Peng Liu,Xianglong Tang
DOI: https://doi.org/10.1109/icassp48485.2024.10448236
2024-01-01
Abstract:Video summarization involves creating a succinct overview by merging the valuable parts of a video. Existing video summarization methods approach this task as a problem of selecting keyframes by frame- and shot-level techniques with unimodal or bimodal information. Besides underestimated inter-relations between various configurations of modality embedding spaces, current methods are also limited in their ability to maintain the integrity of the semantics within the same summary segment. To address these issues, we propose a novel multimodal temporal interest detection network (MTIDNet), to learn multimodal features in the fine- and coarse-grained embedding spaces using the mutual cross fusion layer. Furthermore, we design a temporal interest detection network to predict the importance scores and boundaries of each temporal segment that possesses local and global features across shots. Experimental results demonstrate the effectiveness of our MTIDNet on challenging datasets (SumMe and TVSum).
What problem does this paper attempt to address?