MoBox: Enhancing Video Object Segmentation with Motion-Augmented Box Supervision

Xiaomin Li,Qinghe Wang,Dezhuang Li,Mengmeng Ge,Xu Jia,You He,Huchuan Lu
DOI: https://doi.org/10.1109/tcsvt.2024.3451981
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:We propose MoBox, a low-cost solution for semi-supervised video object segmentation that requires only bounding boxes as manual annotations for training. Built upon a mature semi-supervised video object segmentation network, we redesign the training losses and employ a more stringent training strategy. Specifically, we introduce a well-designed constraint term that enhances traditional spatial projection by simultaneously leveraging the projections of both the ground-truth box and the predicted mask across two axes, rather than evaluating discrepancies along the x-axis and y-axis independently. To harness the intrinsic properties of videos, considering the underlying correspondence between motion represented by optical flow and the original image, we incorporate motion coherence information into the color consistency loss as supplementary information and propose a motion discrepancy loss to obtain accurate boundaries. Additionally, to mitigate the ambiguity of weak supervision, we further introduce the pseudo strict constraint during training, which significantly improves model performance. Our approach yields competitive scores on popular benchmarks, achieving a J & F score of 78.6 on the DAVIS 2017 validation set and an Overall score of 78.0 on the YouTube-VOS 2018 validation set. These results highlight the efficacy of MoBox, demonstrating that the semi-supervised video object segmentation model can be effectively trained using only motion-augmented box supervision and intrinsic information of videos.
What problem does this paper attempt to address?