Learning Channel-Wise Spatio-Temporal Representations for Video Salient Object Detection

Kan Huang,Ge Li,Shan Liu
DOI: https://doi.org/10.1016/j.neucom.2020.04.015
IF: 6
2020-01-01
Neurocomputing
Abstract:Video salient object detection aims at extracting most attention-grabbing objects in videos, which tends to greatly enhance many vision based tasks such as video understanding. In this work we explore this research issue from a novel perspective, i.e., learning the spatio-temporal representations associated with salient regions in separated feature channels. We propose a Channel-wise Spatio-Temporal Representation learning block (CSTR), which is trained to discriminate between salient spatio-temporal patterns and non-salient spatio-temporal patterns in separated channels. A whole CNN architecture based on this block is constructed for video salient object detection. This architecture combines dynamic saliency learned from CSTR and static saliency learned from a constructed Multi-scale Dilated Convolution block (MDC), deriving the final saliency detection results. This intuitive combination improves feature representation capability which contributes to more precise detection results. Compared with previous works that leverage optical flow or RNNs (LSTM, GRU etc.) to utilize temporal cues, the proposed method is simple to implement and offers an intuitive way to understand how spatio-temporal patterns are correlated with salient regions. Extensive experimental evaluations verify the effectiveness of the insight of the proposed method and confirm that our proposed model outperforms other outstanding methods on four popular benchmarks.
What problem does this paper attempt to address?