Abstract:The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions. Recently, supervised approaches have achieved superior performance on video action or scene segmentation over unsupervised approaches. In this work, we improve supervised VTS by thoroughly exploring multimodal fusion and multimodal coherence modeling. Specifically, (1) we enhance multimodal fusion by exploring different architectures using cross-attention and mixture of experts. (2) To generally strengthen multimodality alignment and fusion, we pre-train and fine-tune the model with multimodal contrastive learning. (3) We propose a new pre-training task tailored for the VTS task, and a novel fine-tuning task for enhancing multimodal coherence modeling for VTS. We evaluate the proposed approaches on educational videos, in the form of lectures, due to the vital role of topic segmentation of educational videos in boosting learning experiences. Additionally, we introduce a large-scale Chinese lecture video dataset to augment the existing English corpus, promoting further research in VTS. Experiments on both English and Chinese lecture datasets demonstrate that our model achieves superior VTS performance compared to competitive unsupervised and supervised baselines.

What problem does this paper attempt to address?

This paper attempts to solve the problem of video topic segmentation (VTS). Specifically, the goal of the VTS task is to divide a video into coherent and non - overlapping topics, so as to facilitate efficient understanding of video content and quick access to specific content. VTS is also crucial for various downstream video understanding tasks. Traditional methods use shallow features or unsupervised methods and perform poorly in identifying the nuances of topic transitions. In recent years, supervised methods have achieved better performance than unsupervised methods in video action or scene segmentation tasks. Therefore, this paper improves the supervised VTS method by deeply exploring multi - modal fusion and multi - modal consistency modeling. ### Main contributions: 1. **Propose a supervised multi - modal sequence labeling model (MMVTS model)**: Compare different multi - modal fusion architectures, and propose new self - supervised pre - training tasks and fine - tuning tasks to enhance multi - modal consistency modeling. 2. **Introduce a large - scale Chinese lecture video dataset (CLVTS)**: Promote further research in the VTS field. 3. **Experiments show that the model achieves new state - of - the - art (SOTA) performance on both English and Chinese lecture video datasets**, surpassing competitive unsupervised and supervised baseline models. Comprehensive ablation studies further confirm the effectiveness of these methods. ### Method overview: - **Multi - modal fusion**: Enhance the fusion of multi - modal information by comparing different cross - attention - based and mixture - of - experts (MoE) - based multi - modal fusion architectures. - **Pre - training and fine - tuning**: Use multi - modal contrastive learning for pre - training to strengthen cross - modal alignment; propose new pre - training tasks and fine - tuning tasks to enhance multi - modal consistency modeling. - **Multi - modal consistency modeling**: Improve the multi - modal consistency of the model by increasing the similarity of multi - modal features within the same topic and the difference between multi - modal features of different topics. ### Experimental results: - **Unimodal performance**: LongFormer in the text modality significantly outperforms BaSSL in the visual modality and the unsupervised method UnsupA VLS, indicating that the text modality provides more accurate and rich information in the VTS task. - **Multi - modal performance**: Compared with LongFormer using only the text modality, multi - modal methods perform better on some metrics. Especially, after pre - training and fine - tuning with enhanced multi - modal consistency modeling, the MMVTS model with a multi - modal fusion layer of Co - Attention and MoE achieves the best average performance, BS@30 and F1@30 results on the A VLecture dataset, and obtains the best F1 and nearly the best average score on the CLVTS dataset. ### Conclusion: This paper significantly improves the performance of the supervised VTS method by improving multi - modal fusion and multi - modal consistency modeling. The proposed MMVTS model performs well on multiple datasets, providing a new direction for future VTS research.

Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

TransVOS: Video Object Segmentation with Transformers

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Self-supervised Video Object Segmentation Using Integration-Augmented Attention

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Towards Open-Vocabulary Video Semantic Segmentation

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Representation Learning Through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

Video Captioning with Guidance of Multimodal Latent Topics

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation