Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Morteza Moradi,Simone Palazzo,Concetto Spampinato
2024-01-16
Abstract:In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively aggregate and decode temporal features in video saliency prediction (VSP), especially in terms of capturing long - range temporal dependencies. Specifically: 1. **Limitations of Existing Methods**: - Traditional 3D convolutional networks and LSTM - based networks are insufficient in capturing long - range temporal dependencies. - Although existing spatio - temporal Transformer - based methods have made some improvements, how to effectively handle high - dimensional temporal features in the decoding stage remains a challenge. 2. **Research Objectives**: - Propose a Transformer - based video saliency prediction model (THTD - Net), which can fully utilize the complete temporal information provided by the encoder in the decoding stage without reducing the temporal dimension. - Explore effective strategies for gradually reducing the dimension of temporal features to avoid sudden loss of information and ensure that each decoding stage can provide rich information. 3. **Main Contributions**: - Design a lightweight single - decoder architecture, which avoids multi - branch decoders and complex attention mechanisms and reduces the number of model parameters. - Experimental results on multiple benchmark datasets (such as DHF1K, UCF - sports and Hollywood - 2) show that the performance of this model is comparable to or better than that of the current state - of - the - art methods, while having higher parameter efficiency. Through these improvements, THTD - Net aims to process spatio - temporal features in videos more effectively and improve the accuracy and efficiency of video saliency prediction.