Abstract:In virtual reality (VR), correct and precise estimations of user’s visual fixations and head movements can enhance the quality of experience by allocating more computation resources for analysing and rendering on the areas of interest. However, there is insufficient research about understanding the visual exploration of users when modeling VR visual attention. To bridge the gap between the saliency prediction for traditional 2D content and omnidirectional content, we construct the visual attention dataset and propose the visual saliency prediction framework for panoramic videos. Around the instantaneous viewing behavior, we propose a traditional method to adapt 2D saliency models and design a CNN-based model to better predict visual saliency. In the proposed traditional model, mechanism of visual attention and viewing behaviors are considered in the computation of edge weights on graphs which are interpreted as Markov chains. The fraction of the visual attention that is diverted to each high-clarity vision (HCV) area is estimated through equilibrium distribution of this chain. We also propose the Graph-Based CNN model. The RGB channel and optical flow form the spatial-temporal units of HCVs, from which node feature vectors are extracted. Graph convolution is used to learn the mutual information between node feature vectors of HCVs and retain geometric information. Then feature vectors are aligned according to geometry structure of equirectangular format, and the feature decoder maps the aligned feature maps to the data distribution. We also construct the dynamic omnidirectional monocular (DOM) saliency dataset with 64 diverse videos evaluated by 28 people. The subjective results show that the instantaneous viewing behavior is important in the VR experience. Extensive experiments are conducted on the dataset and the results demonstrate the effectiveness of the proposed framework. The dataset will be released to facilitate the future studies related to visual saliency prediction for 360-degree contents.

Predicting 360° Video Saliency: A ConvLSTM Encoder-Decoder Network with Spatio-temporal Consistency

Learning Stereoscopic Visual Attention Model for 3d Video

Saliency Prediction Network for $360^\circ$ Videos

A Spherical Convolution Approach for Learning Long Term Viewport Prediction in 360 Immersive Video

Viewing Behavior Supported Visual Saliency Predictor for 360 Degree Videos

SVGC-AVA: 360-Degree Video Saliency Prediction with Spherical Vector-Based Graph Convolution and Audio-Visual Attention

360Spred: Saliency Prediction for 360-Degree Videos Based on 3D Separable Graph Convolutional Networks

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

Dilated Convolutional Neural Networks for Panoramic Image Saliency Prediction

MRGAN360: Multi-stage Recurrent Generative Adversarial Network for 360 Degree Image Saliency Prediction

Hybrid Attention Spatial-Temporal Network for Video Saliency Prediction

360° Image Saliency Prediction by Embedding Self-Supervised Proxy Task

Optimizing Fixation Prediction Using Recurrent Neural Networks for 360$^{\circ }$ Video Streaming in Head-Mounted Virtual Reality

Video Saliency Prediction using Spatiotemporal Residual Attentive Networks.

Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information

Viewport-adaptive 360-degree video coding

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

On the Consensus of Synchronous Temporal and Spatial Views: A Novel Multimodal Deep Learning Method for Social Video Prediction

360$^{\circ}$ Image Saliency Prediction by Embedding Self-Supervised Proxy Task

A Learning-Based Visual Saliency Prediction Model for Stereoscopic 3D Video (LBVS-3D)

Spherical Vision Transformer for 360-degree Video Saliency Prediction