Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection

Qinyao Chang,Shiping Zhu
DOI: https://doi.org/10.1007/s12559-023-10114-x
IF: 4.89
2023-02-24
Cognitive Computation
Abstract:Inspired by the human vision attention mechanism, the human vision system uses multilevel features to extract accurate visual saliency information, so multilevel features are important for saliency detection. On the basis of the numerous biological frameworks for visual information processing, we find that better combination and use of multilevel features with time information can greatly improve the accuracy of the video saliency model. The proposed TSFP-Net has the advantages of much higher prediction precision, simple structure, second smallest size, and the third fastest running time compared to the state-of-the-art methods. The encoder extracts multiscale temporal-spatial features from the input continuous video frames and then constructs a temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can significantly improve the precision of video saliency detection. Experimental results on three purely visual video saliency benchmarks demonstrate that our method outperforms the existing state-of-the-art methods.
computer science, artificial intelligence,neurosciences
What problem does this paper attempt to address?