Visual-Verbal Consistency Of Image Saliency

Haoran Liang,Ming Jiang,Ronghua Liang,Qi Zhao
DOI: https://doi.org/10.1109/SMC.2017.8123171
2017-01-01
Abstract:When looking at an image, humans shift their attention towards interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem, we look into eye fixations and image captions, two types of subjective annotations that are relatively task-free and natural. From the annotations, we extract visual and verbal saliency ranks to compare against each other. We then propose a number of low-level and semantic-level features relevant to the visual-verbal consistency. Integrated into a computational model, the proposed features effectively predict the consistency between the two modalities on a large dataset with both types of annotations, namely SALICON [1].
What problem does this paper attempt to address?