One-shot Training for Video Object Segmentation

Baiyu Chen,Sixian Chan,Xiaoqin Zhang
2024-05-23
Abstract:Video Object Segmentation (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects. Previous VOS works typically rely on fully annotated videos for training. However, acquiring fully annotated training videos for VOS is labor-intensive and time-consuming. Meanwhile, self-supervised VOS methods have attempted to build VOS systems through correspondence learning and label propagation. Still, the absence of mask priors harms their robustness to complex scenarios, and the label propagation paradigm makes them impractical in terms of efficiency. To address these issues, we propose, for the first time, a general one-shot training framework for VOS, requiring only a single labeled frame per training video and applicable to a majority of state-of-the-art VOS networks. Specifically, our algorithm consists of: i) Inferring object masks time-forward based on the initial labeled frame. ii) Reconstructing the initial object mask time-backward using the masks from step i). Through this bi-directional training, a satisfactory VOS network can be obtained. Notably, our approach is extremely simple and can be employed end-to-end. Finally, our approach uses a single labeled frame of YouTube-VOS and DAVIS datasets to achieve comparable results to those trained on fully labeled datasets. The code will be released.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the problem of Video Object Segmentation (VOS), which is the task of tracking and segmenting objects in videos. Traditional VOS methods usually require fully annotated videos for training, which is both time-consuming and expensive. To address this issue, the paper proposes a new end-to-end one-shot training framework that only requires one annotated frame for each training video, effectively training most advanced VOS networks. The authors observed that VOS networks can predict rough masks in video sequences even starting from noisy reference masks (such as empty or all-black masks). Based on this finding, they designed a feedback loop that predicts object masks through forward inference and then reconstructs the initial masks through backward reconstruction. This bi-directional training method enables the network to be effectively trained without relying on a large amount of annotated data. The main contributions of the paper include: 1. The first proposal of a one-shot training framework applicable to most advanced VOS networks, requiring only one labeled frame per video. 2. A simple and end-to-end training method that demonstrates strong label efficiency and generalization ability. 3. Achieving comparable results to fully annotated data using only one labeled frame from the YouTube-VOS and DAVIS datasets. Compared to self-supervised VOS methods, this method is more efficient but lacks mask priors, which may result in weak robustness for complex scenes. In contrast, semi-supervised methods may require two labeled frames and multiple training stages, while the approach in this paper only requires one labeled frame and can be implemented in an end-to-end manner. Experimental results show that VOS networks trained using this method achieve performance comparable to models trained with fully annotated data on benchmark tests like DAVIS and YouTube-VOS, demonstrating its effectiveness in reducing annotation costs.