Video Object Segmentation by Multi-Scale Attention Using Bidirectional Strategy

Jingxin Wang,Yunfeng Zhang,Fangxun Bao,Yuetong Liu,Qiuyue Zhang,Caiming Zhang
DOI: https://doi.org/10.1016/j.imavis.2024.105136
IF: 3.86
2024-01-01
Image and Vision Computing
Abstract:This paper focuses on semi-supervised video object segmentation (VOS). Recently, several Space‐Time Memory based networks have effectively improved the performance of VOS. However, most methods predict the target object mask forwardly, which causes error propagation to mislead the future frame segmentation. Moreover, the rich multi-scale information of objects needs to be effectively exploited in videos to extract fine-grained multi-scale spatial information. To address these limitations, we present a network with a multi-scale attention module for semi-supervised VOS, which combines a new bidirectional strategy during training. Firstly, we propose the bidirectional strategy in which a backward flow combines the existing standard forward flow. With the strategy, we can rely on the first frame's ground-truth mask to mitigate the problem of error propagation. Secondly, a multi-scale attention module is designed to extracts multi-scale features by different weights and interacts with information between multi-scale channel attention. Especially the multi-scale attention module can effectively extract the fine-grained mask by the network during the bidirectional training. Experimental results show that our network achieves significant segmentation performance compared to state-of-the-art approaches on the YouTube-VOS and DAVIS datasets.
What problem does this paper attempt to address?