Abstract:Moving object segmentation (MOS) is one of the important and well studied computer vision tasks that is used in a variety of applications, such as video surveillance systems, human tracking, self-driving cars, and video compression. While traditional approaches to MOS rely on hand-crafted features or background modeling, deep learning methods using Convolution Neural Networks (CNNs) have been shown to be more effective in extracting features and achieving better accuracy. However, most deep learning-based methods for MOS offer scene-dependent solutions, leading to reduced performance when tested on previously unseen video content. Because spatial features are insufficient to represent the motion information, the spatial and temporal features should be used together to succeed in un seen videos. To address this issue, we propose the MOS-Net deep framework, an encoder-decoder network that combines spatial and temporal features using the flux tensor algorithm, 3D CNNs, and ConvLSTM in its different variants. MOS-Net 2.0 is an enhanced version of the base MOS-Net structure, where additional ConvL STM modules are added to 3D CNNs for extracting long-term spatiotemporal features. In the final stage of the framework the output of the encoder-decoder network, the foreground probability map, is thresholded for producing a binary mask where moving objects are in the foreground and the rest forms the background. In addition, an ablation study has been conducted to evaluate different combinations as inputs to the proposed network, using the ChangeDetection2014 (CDnet2014) which in cludes challenging videos such as those with dynamic backgrounds, bad weather, and illumination changes. In most approaches, the training and test strategy are not announced, making it difficult to compare the algorithm results. In addition, the pro posed method can be evaluated differently as video-optimized or video-agnostic. In video-optimized approaches, the training and test set is obtained randomly and sep arated from the overall dataset. The results of the proposed method are compared with competitive methods from the literature using the same evaluation strategy. It has been observed that the introduced MOS networks give highly competitive re sults on the CDnet2014 dataset. The source code for the simulations provided in this work is available online.

MoNet: Deep Motion Exploitation for Video Object Segmentation

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

MATNet: Motion-Attentive Transition Network for Zero-Shot Video Object Segmentation.

Motion-Attentive Transition for Zero-Shot Video Object Segmentation

Deep Transport Network for Unsupervised Video Object Segmentation

MoBox: Enhancing Video Object Segmentation with Motion-Augmented Box Supervision

Moving Object Proposals with Deep Learned Optical Flow for Video Object Segmentation

MOA-Net: Multilevel Object Aware Network for Remote Sensing Image Semantic Segmentation

Fast Video Object Segmentation Via Dynamic Targeting Network

3D convolutional long short-term encoder-decoder network for moving object segmentation

Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation.

MUNet: Motion uncertainty-aware semi-supervised video object segmentation

MoNet: Motion-Based Point Cloud Prediction Network

MA-ResNet50: A General Encoder Network for Video Segmentation.

Efficient Unsupervised Video Object Segmentation Network Based on Motion Guidance

Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation

Prototypical Matching Networks for Video Object Segmentation.

Implicit Motion-Compensated Network for Unsupervised Video Object Segmentation

Motion Cues Guided Feature Aggregation and Enhancement for Video Object Segmentation

Motion-Guided Spatial Time Attention for Video Object Segmentation.

Deep Object Co-segmentation via Spatial-Semantic Network Modulation