Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Morteza Moradi,Simone Palazzo,Concetto Spampinato

2024-01-16

Abstract:In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.

Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively aggregate and decode temporal features in video saliency prediction (VSP), especially in terms of capturing long - range temporal dependencies. Specifically: 1. **Limitations of Existing Methods**: - Traditional 3D convolutional networks and LSTM - based networks are insufficient in capturing long - range temporal dependencies. - Although existing spatio - temporal Transformer - based methods have made some improvements, how to effectively handle high - dimensional temporal features in the decoding stage remains a challenge. 2. **Research Objectives**: - Propose a Transformer - based video saliency prediction model (THTD - Net), which can fully utilize the complete temporal information provided by the encoder in the decoding stage without reducing the temporal dimension. - Explore effective strategies for gradually reducing the dimension of temporal features to avoid sudden loss of information and ensure that each decoding stage can provide rich information. 3. **Main Contributions**: - Design a lightweight single - decoder architecture, which avoids multi - branch decoders and complex attention mechanisms and reduces the number of model parameters. - Experimental results on multiple benchmark datasets (such as DHF1K, UCF - sports and Hollywood - 2) show that the performance of this model is comparable to or better than that of the current state - of - the - art methods, while having higher parameter efficiency. Through these improvements, THTD - Net aims to process spatio - temporal features in videos more effectively and improve the accuracy and efficiency of video saliency prediction.

Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Video Saliency Forecasting Transformer

TransVOS: Video Object Segmentation with Transformers

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Transformer-based Multi-scale Feature Integration Network for Video Saliency Prediction

Hybrid Attention Spatial-Temporal Network for Video Saliency Prediction

SalFoM: Dynamic Saliency Prediction with Video Foundation Models

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model

TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

VST++: Efficient and Stronger Visual Saliency Transformer

CTVSR: Collaborative Spatial-Temporal Transformer for Video Super-Resolution

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection

A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction