Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Hai Yu,Chong Deng,Qinglin Zhang,Jiaqing Liu,Qian Chen,Wen Wang
2024-08-01
Abstract:The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions. Recently, supervised approaches have achieved superior performance on video action or scene segmentation over unsupervised approaches. In this work, we improve supervised VTS by thoroughly exploring multimodal fusion and multimodal coherence modeling. Specifically, (1) we enhance multimodal fusion by exploring different architectures using cross-attention and mixture of experts. (2) To generally strengthen multimodality alignment and fusion, we pre-train and fine-tune the model with multimodal contrastive learning. (3) We propose a new pre-training task tailored for the VTS task, and a novel fine-tuning task for enhancing multimodal coherence modeling for VTS. We evaluate the proposed approaches on educational videos, in the form of lectures, due to the vital role of topic segmentation of educational videos in boosting learning experiences. Additionally, we introduce a large-scale Chinese lecture video dataset to augment the existing English corpus, promoting further research in VTS. Experiments on both English and Chinese lecture datasets demonstrate that our model achieves superior VTS performance compared to competitive unsupervised and supervised baselines.
Artificial Intelligence,Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of video topic segmentation (VTS). Specifically, the goal of the VTS task is to divide a video into coherent and non - overlapping topics, so as to facilitate efficient understanding of video content and quick access to specific content. VTS is also crucial for various downstream video understanding tasks. Traditional methods use shallow features or unsupervised methods and perform poorly in identifying the nuances of topic transitions. In recent years, supervised methods have achieved better performance than unsupervised methods in video action or scene segmentation tasks. Therefore, this paper improves the supervised VTS method by deeply exploring multi - modal fusion and multi - modal consistency modeling. ### Main contributions: 1. **Propose a supervised multi - modal sequence labeling model (MMVTS model)**: Compare different multi - modal fusion architectures, and propose new self - supervised pre - training tasks and fine - tuning tasks to enhance multi - modal consistency modeling. 2. **Introduce a large - scale Chinese lecture video dataset (CLVTS)**: Promote further research in the VTS field. 3. **Experiments show that the model achieves new state - of - the - art (SOTA) performance on both English and Chinese lecture video datasets**, surpassing competitive unsupervised and supervised baseline models. Comprehensive ablation studies further confirm the effectiveness of these methods. ### Method overview: - **Multi - modal fusion**: Enhance the fusion of multi - modal information by comparing different cross - attention - based and mixture - of - experts (MoE) - based multi - modal fusion architectures. - **Pre - training and fine - tuning**: Use multi - modal contrastive learning for pre - training to strengthen cross - modal alignment; propose new pre - training tasks and fine - tuning tasks to enhance multi - modal consistency modeling. - **Multi - modal consistency modeling**: Improve the multi - modal consistency of the model by increasing the similarity of multi - modal features within the same topic and the difference between multi - modal features of different topics. ### Experimental results: - **Unimodal performance**: LongFormer in the text modality significantly outperforms BaSSL in the visual modality and the unsupervised method UnsupA VLS, indicating that the text modality provides more accurate and rich information in the VTS task. - **Multi - modal performance**: Compared with LongFormer using only the text modality, multi - modal methods perform better on some metrics. Especially, after pre - training and fine - tuning with enhanced multi - modal consistency modeling, the MMVTS model with a multi - modal fusion layer of Co - Attention and MoE achieves the best average performance, BS@30 and F1@30 results on the A VLecture dataset, and obtains the best F1 and nearly the best average score on the CLVTS dataset. ### Conclusion: This paper significantly improves the performance of the supervised VTS method by improving multi - modal fusion and multi - modal consistency modeling. The proposed MMVTS model performs well on multiple datasets, providing a new direction for future VTS research.