Context-aware Deformable Alignment for Video Object Segmentation

Jie Yang,Mingfu Xia,Xue Zhou
DOI: https://doi.org/10.1109/icpr56361.2022.9956170
2022-01-01
Abstract:Matching-based Semi-supervised video object segmentation (VOS) either resorts to non-local matching to retrieve and aggregate the spatiotemporal features of past frames or relies on template matching to learn similarity representation. Although achieving remarkable progress, they still suffer from considerable computation overhead and failure when confronting with large appearance changes, respectively. In this paper, we are motivated to address the above issues. Firstly, we propose a Context-aware Deformable Alignment (CDA) mechanism to align the spatio-temporal features of past frames more efficiently. To reduce computation complexity dramatically and retain the ability of modeling long-range spatio-temporal dependencies, the CDA mechanism that learns where to match in a deformable fashion belongs to local context-aware matching instead of nonlocal pixel-wise matching. Furthermore, we present a Dynamic Kernel Matching (DKM) technique to tackle the mismatches due to appearance and scale variations. DKM dynamically adapts the template feature to object appearance changes rather than fixing it, which improves the robustness for long-term VOS. Our framework dubbed CDANet is evaluated on popular benchmark sets, which achieves competitive performance compared with SOTA methods.
What problem does this paper attempt to address?