From Pixels to Semantics: Self-Supervised Video Object Segmentation with Multiperspective Feature Mining

Ruoqi Li,Yifan Wang,Lijun Wang,Huchuan Lu,Xiaopeng Wei,Qiang Zhang
DOI: https://doi.org/10.1109/tip.2022.3201603
IF: 10.6
2022-01-01
IEEE Transactions on Image Processing
Abstract:Existing self-supervised methods pose one-shot video object segmentation (O-VOS) as pixel-level matching to enable segmentation mask propagation across frames. However, the two tasks are not fully equivalent since O-VOS is more reliant on semantic correspondence rather than accurate pixel matching. To remedy this issue, we explore a new self-supervised framework that integrates pixel-level correspondence learning with semantic-level adaptation. The pixel-level correspondence learning is performed through photometric reconstruction of adjacent RGB frames during offline training, while semantic-level adaption operates at test-time by enforcing a bi-directional agreement of the predicted segmentation masks. In addition, we further propose a new network architecture with multi-perspective feature mining mechanism which can not only enhance reliable features but also suppress noisy ones to facilitate more robust image matching. By training the network using the proposed self-supervised framework, we achieve state-of-the-art performance on widely adopted datasets, further closing up the gap between self-supervised learning methods and their fully supervised counterparts.
What problem does this paper attempt to address?