COMatchNet: Co-Attention Matching Network for Video Object Segmentation.

Lufei Huang,Fengming Sun,Xia Yuan
DOI: https://doi.org/10.1007/978-3-031-02375-0_20
2021-01-01
Abstract:Semi-supervised video object segmentation (semi-VOS) predicts pixel-accurate masks of the target objects in all frames according to the ground truth mask provided in the first frame. A critical challenge to this task is how to model the dependency between the query frame and other frames. Most methods neglect or do not make full use of the inherent relevance. In this paper, we propose a novel network called CO-Attention Matching Network (COMatchNet) for semi-VOS. The COMatchNet mainly consists of a co-attention module and a matching module. The co-attention module extracts frame correlation among the query frame and the previous frame and the first frame. The matching module calculates pixel-level matching scores and finds the most similar regions to preceding frames in the query frame. The COMatchNet integrates two level information and generates fine-grained segmentation masks. We conduct extensive experiments on three popular video object segmentation benchmarks, i.e. DAVIS 2016; DAVIS 2017; YouTube-VOS. Our COMatchNet achieves competitive performance (J&F) of 86.8%, 75.9%, and 81.4% on the above benchmarks, respectively.
What problem does this paper attempt to address?