Hybrid Attention Spatial-Temporal Network for Video Saliency Prediction

Qi-Yun Dong,Geng-Sheng Chen,Xiao-Fang Zhou
DOI: https://doi.org/10.1109/icsict55466.2022.9963218
2022-01-01
Abstract:Video saliency prediction (VSP) is to understand and model a human’s visual attention in a dynamic scene. Current methods tend to generate visual representations on a fixed local spacetime, neglecting the inherent long-range spatiotemporal relations. To remedy this deficiency, in this paper, we propose a new Hybrid Attention Spatial-Temporal Network (HAST-Net) for VSP. First, we design a novel Local Perception Module (LPM) to combine a dynamic position encoding block and a channel attention block together for a more effective extraction of position relations and channel information. Second, we use a Spatial-Temporal Multi-head Self-attention (STMSA) module, followed by a Convolutional Residual Feed Forward Network (CRFFN) module, to capture the long-range spatiotemporal dependencies. Third, we add above modules to a 3D convolutional backbone, place them in sequence to integrate channel attention with self-attention. Experiment results show that our model outperforms the state-of-the-art models, increasing the NSS and CC on DHF1K dataset by 2.8% and 2.7% respectively.
What problem does this paper attempt to address?