Learning Coupled Convolutional Networks Fusion for Video Saliency Prediction

Zhe Wu,Li Su,Qingming Huang
DOI: https://doi.org/10.1109/tcsvt.2018.2870954
IF: 5.859
2019-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Visual saliency provides important information for understanding scenes in many computer vision tasks. The existing video saliency algorithms mainly focus on predicting spatial and temporal saliency maps; however, these maps are simply fused without considering the complex dynamic scenes in videos. To overcome this drawback, we propose a deep convolutional fusion framework (DCF) for video saliency prediction. The proposed model, which is based on coupled fully convolutional networks (FCNs), effectively encodes spatiotemporal information by integrating spatial and temporal features. We demonstrate that this information is helpful for accurately fusing the spatial and temporal saliency maps according to changes in video scenes. In particular, we gradually design three different deep fusion architectures to investigate how to better utilize the spatiotemporal information. Moreover, we propose a reasonable sampling strategy for selecting suitable training sets for the coupled FCNs. Through extensive experiments, we demonstrate that our model outperforms state-of-the-art algorithms on four public video saliency datasets.
What problem does this paper attempt to address?