Text to Region: Visual-Word Guided Saliency Detection.

Tengfei Xing,Zhaohui Wang,Jianyu Yang,Yi Ji,Chunping Liu
DOI: https://doi.org/10.1007/978-3-030-00764-5_68
2018-01-01
Abstract:Image/video captioning based on neural network can generate accurate description. But how to convert visual information into natural language representation is a true enigma. Existing caption-guided saliency methods take the entire sentence as input to generate a saliency map, which exposes the region-to-word mapping. However, visual information is not related to every word in caption. We eliminate these meaningless stop words such as 'the', 'of' to avoid misleading. We also utilize MFB (Multi-modal Factorized Bilinear Pooling) to fuse C3D features, which could provide richer spatiotemporal information to exposure visual-word guided saliency. Such the system produces better spatiotemporal heatmaps for both predicted captions and arbitrary query sentences without introducing attentional layers. The experimental results on MSR-VTT and Flickr30K dataset surpasses the state-of-the-art by a significant margin.
What problem does this paper attempt to address?