Learning high-level concepts by training a deep network on eye fixations

Chengyao Shen,Mingli Song,Qi Zhao
2012-01-01
Abstract:Visual attention is the ability to select visual stimuli that are most behaviorally relevant among the many others. It allows us to allocate our limited processing resources to the most informative part of the visual scene. In this paper, we learn general high-level concepts with the aid of selective attention in a principled unsupervised framework, where a three layer deep network is built and greedy layerwise training is applied to learn mid-and high-level features from salient regions of images. The network is demonstrated to be able to successfully learn meaningful high-level concepts such as faces and texts in the third-layer and mid-level features like junctions, textures, and parallelism in the second-layer. Unlike pretrained object detectors that are recently included in saliency models to predict semantic objects, the higher-level features we learned are general base features that are not restricted to one or few object categories. A saliency model built upon the learned features demonstrates its competitive predictive power in natural scenes compared with existing methods.
What problem does this paper attempt to address?