Abstract:Background subtraction is a fundamental task in computer vision with numerous real-world applications, ranging from object tracking to video surveillance. Dynamic backgrounds poses a significant challenge here. Supervised deep learning-based techniques are currently considered state-of-the-art for this task. However, these methods require pixel-wise ground-truth labels, which can be time-consuming and expensive. In this work, we propose a weakly supervised framework that can perform background subtraction without requiring per-pixel ground-truth labels. Our framework is trained on a moving object-free sequence of images and comprises two networks. The first network is an autoencoder that generates background images and prepares dynamic background images for training the second network. The dynamic background images are obtained by thresholding the background-subtracted images. The second network is a U-Net that uses the same object-free video for training and the dynamic background images as pixel-wise ground-truth labels. During the test phase, the input images are processed by the autoencoder and U-Net, which generate background and dynamic background images, respectively. The dynamic background image helps remove dynamic motion from the background-subtracted image, enabling us to obtain a foreground image that is free of dynamic artifacts. To demonstrate the effectiveness of our method, we conducted experiments on selected categories of the CDnet 2014 dataset and the I2R dataset. Our method outperformed all top-ranked unsupervised methods. We also achieved better results than one of the two existing weakly supervised methods, and our performance was similar to the other. Our proposed method is online, real-time, efficient, and requires minimal frame-level annotation, making it suitable for a wide range of real-world applications.
What problem does this paper attempt to address?
The paper attempts to address the problem of real-time background subtraction in dynamic backgrounds. Background subtraction is a fundamental task in computer vision, widely used in object tracking, video surveillance, and other fields. However, the presence of dynamic backgrounds (such as fountains, swaying trees, water surface fluctuations, etc.) poses a significant challenge to background subtraction, as these dynamic changes may be mistakenly detected as foreground objects, thereby affecting the performance of the algorithm.
Traditional background subtraction methods, such as those based on statistical approaches and dynamic feedback mechanisms, have limitations when dealing with dynamic backgrounds. In recent years, although deep learning-based methods have made significant progress in background subtraction tasks, these methods usually require pixel-level annotated data for supervised training, which is not only time-consuming but also costly.
To address these issues, the paper proposes a weakly supervised framework that can perform background subtraction without relying on pixel-level ground truth labels. Specifically, the framework utilizes two neural networks: an Autoencoder to generate static background images, and a U-Net to predict dynamic background images. Through this approach, the paper aims to provide an efficient, real-time, and cost-effective solution, particularly suitable for background subtraction tasks in dynamic background scenarios.