CSANet for Video Semantic Segmentation with Inter-Frame Mutual Learning

Yichen Yuan,Lijun Wang,Yifan Wang
DOI: https://doi.org/10.1109/lsp.2021.3103666
2021-01-01
IEEE Signal Processing Letters
Abstract:Video semantic segmentation aims atgenerating temporal consistent segmentation results and is still a very challenging task in the deep learning era. In this work, we improve prior approaches from two aspects. On the network architecture level, we present the cross and self-attention network (CSANet). As opposed to prior methods, CSANet not only propagates temporal features from adjacent frames, but is also designed to aggregate spatial context within the current frame, which is shown to effectively improve the consistency and robustness of the extracted deep features. On the loss function level, we further propose the inter-frame mutual learning strategy which ensures the cross-attention module to focus on semantically correlated context regions, allowing the segmentation results at different frames to be collaboratively improved. By combining the above two novel designs, we show that our proposed method is able to deliver state-of-the-art performance on the Cityscapes and CamVid benchmarks.
What problem does this paper attempt to address?