Semi-supervised Spatial Temporal Attention Network for Video Polyp Segmentation

Zhao Xinkai,Wu Zhenhua,Tan Shuangyi,Fan De-Jun,Li Zhen,Wan Xiang,Li Guanbin
DOI: https://doi.org/10.1007/978-3-031-16440-8_44
2022-01-01
Abstract:Deep learning-based polyp segmentation approaches have achieved great success in image datasets. However, the frame-by-frame annotation of polyp videos requires a large amount of workload, which limits the application of polyp segmentation algorithms in clinical videos. In this paper, we address the semi-supervised video polyp segmentation task, which requires only sparsely annotated frames to train a video polyp segmentation network. We propose a novel spatial-temporal attention network which is composed of Temporal Local Context Attention (TLCA) module and Proximity Frame Time-Space Attention (PFTSA) module. Specifically, TLCA module is to refine the prediction of the current frame using the prediction results of the nearby frames in the video clip. PFTSA module utilizes a simple yet powerful hybrid transformer architecture to capture long-range dependencies in time and space efficiently. Combined with consistency constraints, the network fuses representations of proximity frames at different scales to generate pseudo-masks for unlabeled images. We further propose a pseudo-mask-based training method. Additionally, we re-masked a subset of LDPolypVideo and applied it as a semi-supervised polyp segmentation dataset for our experiments. Experimental results show that our proposed semi-supervised approach can outperform existing image-level semi-supervised and fully supervised methods with sparse annotation at a speed of 135 fps. The code is available at github.com/ShinkaiZ/SSTAN .
What problem does this paper attempt to address?