Abstract:Scene segmentation and classification (SSC) serve as a critical step towards the field of video structuring analysis. Intuitively, jointly learning of these two tasks can promote each other by sharing common information. However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably leads to the model dominated by one of the two tasks in the training phase. In this paper, from an alternate perspective to overcome the above challenges, we unite these two tasks into one task by a new form of predicting shots link: a link connects two adjacent shots, indicating that they belong to the same scene or category. To the end, we propose a general One Stage Multimodal Sequential Link Framework (OS-MSL) to both distinguish and leverage the two-fold semantics by reforming the two learning tasks into a unified one. Furthermore, we tailor a specific module called DiffCorrNet to explicitly extract the information of differences and correlations among shots. Extensive experiments on a brand-new large scale dataset collected from real-world applications, and MovieScenes are conducted. Both the results demonstrate the effectiveness of our proposed method against strong baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform scene segmentation and classification (SSC) simultaneously in video - structured analysis. Specifically, the paper points out that although these two tasks can promote each other by sharing information, since scene segmentation focuses more on local differences between adjacent shots, while classification requires a global representation of the scene segment, this may cause the model to be dominated by one of the tasks during the training phase. To address this challenge, the authors propose a new method, that is, to unify these two tasks in the form of predicting shot links, thereby overcoming the above - mentioned difficulties. This method can not only distinguish the differences between shots, but also utilize this difference information, thereby improving the effects of scene segmentation and classification. The main contributions of the paper include: 1. Proposing a new perspective to define the scene segmentation and classification problems, that is, transforming them into the form of predicting shot links, and proposing a one - stage multimodal sequential link framework (OS - MSL), which unifies multiple tasks into one stage by predicting shot links. 2. Designing a special network (DiffCorrNet) for simultaneously extracting differences and correlations between adjacent shots. 3. Constructing a new large - scale dataset TI - News, which contains hundreds of news videos and their segmentation and category labels. 4. Extensive experiments on two datasets, TI - News and MovieScenes, demonstrate the effectiveness of the proposed method and achieve state - of - the - art results. Through these contributions, the paper provides a general framework for dealing with video sequence modeling problems, such as action, activity or event recognition, which can all be classified as finding boundaries between two consecutive segments and combining auxiliary information such as labels, key frames and subtitles. In addition, the newly constructed dataset TI - News enriches the data resources in the field of scene understanding and promotes research and applications in this field.

OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Modality-Aware Shot Relating and Comparing for Video Scene Detection

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

Learning a Contextual Multi-Thread Model for Movie/TV Scene Segmentation

Scene Consistency Representation Learning for Video Scene Segmentation

SGMNet: Scene Graph Matching Network for Few-Shot Remote Sensing Scene Classification

Semi-Supervised Multitask Learning for Scene Recognition.

Multistage Scene-Level Constraints for Large-Scale Point Cloud Weakly Supervised Semantic Segmentation.

Object Segmentation by Mining Cross-Modal Semantics

A Unified Framework for 3D Scene Understanding

A Two-Pipeline Instance Segmentation Network via Boundary Enhancement for Scene Understanding

CSMB-VSS: video scene segmentation with cosine similarity matrix

Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection

S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving

2D Semantic-Guided Semantic Scene Completion

MUCH: Mutual Coupling Enhancement of Scene Recognition and Dense Captioning

S^3M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving

LinkNet: 2D-3D Linked Multi-Modal Network for Online Semantic Segmentation of RGB-D Videos

Online Scene Semantic Understanding Based on Sparsely Correlated Network for AR

M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation