OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification

Ye Liu,Lingfeng Qiao,Di Yin,Zhuoxuan Jiang,Xinghua Jiang,Deqiang Jiang,Bo Ren
DOI: https://doi.org/10.48550/arXiv.2207.01241
2022-07-04
Abstract:Scene segmentation and classification (SSC) serve as a critical step towards the field of video structuring analysis. Intuitively, jointly learning of these two tasks can promote each other by sharing common information. However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably leads to the model dominated by one of the two tasks in the training phase. In this paper, from an alternate perspective to overcome the above challenges, we unite these two tasks into one task by a new form of predicting shots link: a link connects two adjacent shots, indicating that they belong to the same scene or category. To the end, we propose a general One Stage Multimodal Sequential Link Framework (OS-MSL) to both distinguish and leverage the two-fold semantics by reforming the two learning tasks into a unified one. Furthermore, we tailor a specific module called DiffCorrNet to explicitly extract the information of differences and correlations among shots. Extensive experiments on a brand-new large scale dataset collected from real-world applications, and MovieScenes are conducted. Both the results demonstrate the effectiveness of our proposed method against strong baselines.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform scene segmentation and classification (SSC) simultaneously in video - structured analysis. Specifically, the paper points out that although these two tasks can promote each other by sharing information, since scene segmentation focuses more on local differences between adjacent shots, while classification requires a global representation of the scene segment, this may cause the model to be dominated by one of the tasks during the training phase. To address this challenge, the authors propose a new method, that is, to unify these two tasks in the form of predicting shot links, thereby overcoming the above - mentioned difficulties. This method can not only distinguish the differences between shots, but also utilize this difference information, thereby improving the effects of scene segmentation and classification. The main contributions of the paper include: 1. Proposing a new perspective to define the scene segmentation and classification problems, that is, transforming them into the form of predicting shot links, and proposing a one - stage multimodal sequential link framework (OS - MSL), which unifies multiple tasks into one stage by predicting shot links. 2. Designing a special network (DiffCorrNet) for simultaneously extracting differences and correlations between adjacent shots. 3. Constructing a new large - scale dataset TI - News, which contains hundreds of news videos and their segmentation and category labels. 4. Extensive experiments on two datasets, TI - News and MovieScenes, demonstrate the effectiveness of the proposed method and achieve state - of - the - art results. Through these contributions, the paper provides a general framework for dealing with video sequence modeling problems, such as action, activity or event recognition, which can all be classified as finding boundaries between two consecutive segments and combining auxiliary information such as labels, key frames and subtitles. In addition, the newly constructed dataset TI - News enriches the data resources in the field of scene understanding and promotes research and applications in this field.