Transformer-based Cross Reference Network for Video Salient Object Detection

Kan Huang,Chunwei Tian,Jingyong Su,Jerry Chun-Wei Lin
DOI: https://doi.org/10.1016/j.patrec.2022.06.006
IF: 4.757
2022-01-01
Pattern Recognition Letters
Abstract:Video salient object detection is a fundamental computer vision task aimed at highlighting the most conspicuous objects in a video sequence. There are two key challenges presented in video salient ob-ject detection: (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. To handle these challenges, in this paper, we propose a novel Transformer-based Cross Reference Network (TCRN), which fully exploits long-range context dependencies in both feature representation extraction and cross-modal (i.e., appear-ance and motion) integration. In contrast to existing CNN-based methods, our approach formulates video salient object detection as a sequence-to-sequence prediction task. In the proposed approach, the deep feature extraction is achieved by a pure vision transformer with multi-resolution token representations. Specifically, we design a Gated Cross Reference (GCR) module to effectively integrate appearance and motion into saliency representation. The GCR first propagates global context information between differ-ent modalities, and then perform cross-modal fusion by a gate mechanism. Extensive evaluations on five widely-used benchmarks show that the proposed Transformer-based method performs favorably against the existing state-of-the-art methods (c) 2022 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?