Tokenizing Features for Fast Video Object Segmentation

Tianfang Meng,Wei Zhang,Wenqiang Zhang
DOI: https://doi.org/10.1109/icme52920.2022.9859678
2022-01-01
Abstract:This paper investigates how to take full advantage of the tem-poral and spatial information in videos with minimal compu-tational cost in the semi-supervised video object segmentation (VOS) task. Current state-of-the-art methods have achieved remarkable performance by matching features of the current frame with those of past frames to propagate the past segmen-tation masks to the current. However, the inference speeds of such matching-based methods are limited due to the tremen-dous amount of computation on pixel-to-pixel matching. To address this problem, we propose a fast matching mechanism for VOS that extracts essential object information as a handful of token vectors for matching. By extracting succinct but suf-ficient information from the pixel-wise features, we develop a fast VOS model which achieves competitive segmentation performance (81.6% $J$ & $F$ on DAVIS-2017), maintaining a high inference speed (FPS = 42.1).
What problem does this paper attempt to address?