Abstract:Our visual system has the ability to categorize images of natural visual scenes at a remarkable speed. Some researches suggested that this ability was based on a fast feedforward sweep of visuomotor processing, as top-down signals that reflect strategic processing or attentional effects are too slow to exert any influences. But recent studies have provided new evidences that some local recurrent feedback connections in early visual areas, which are different from the slower attention-mediated processes, might also impact rapid scene recognition. In addition to attention-mediated processes, expectations are also known to greatly affect our experience of the world in a top-down way. In ambiguous situations, such as under a rapid serial visual presentation condition, knowledge of the world guides our interpretation of the sensory information and helps us recognize the scene quickly and accurately. In the present study, we investigated whether could contextual expectation affect rapid scene recognition and the mechanisms behind those processes at different stages of early visual areas.Experiment 1 used binocular rivalry paradigm. On each trial, we first presented a sequence of identical images to the two eyes in order to generate an expectation (animal/non-animal) about the next scene in the series. We followed this predictive sequence with a rivalry display in which the predicted category of scene was presented (36 ms) to one eye and a non-predicted category was presented to the other eye. There were three conditions: number of predictive sequence images (between 0 and 12), the predicted category (animal/non-animal), and the eye to which the "matching" rivalrous category was presented (left/right eye). Experiment 2 used dual task paradigm, in which participants were asked to accomplish a central word discrimination task and a peripheral natural scene categorization task at the same time. In the central task, two words selected from the categories of animal, plants or office equipment were displayed in the center, and participants were forced to determine whether they were from the same category. In peripheral task, natural scene images containing or not containing animals were flashed for 36ms at one of four corners randomly, and participants were instructed to respond to images that containing one or more animals. Experiment 3 consisted of 3a and 3b whose procedures were the same as Experiment 2 except that they used lower spatial frequencies (3a, LSF) and higher spatial frequencies (3b, HSF) components of the original scenes accordingly.The results showed that observers were more likely to perceive the predicted category of scene at the onset of rivalry, suggesting that expectation can bias subjective perception of incoming sensory information from rapidly presented scene. Results of Experiment 2 show that perception performance in single task of scene recognition was always better than dual task condition (t(19) = 4.65, p < 0.001, Cohen's d = 1.02). The results show that d' of expected condition was bigger than unexpected condition (t(19) = 5.07, p < 0.001, Cohen's d =0.91), as did β(t(19) =3.02, p<0.05, Cohen's d=0.51). These results suggested that expectations generated by visual words can influence both bias and discriminability of observers while detecting animals in rapidly represented natural scene images. LSF scene recognition in Experiment 3a was found significantly better than Experiment 3b's HSF condition (t(19) = 3.26, p < 0.05, Cohen's d = 1.07), which was consistent with results from previous researches that a coarse-to-fine process could account for efficient scene recognition (Musel et al., 2014). Although significant differences of d' between expected and unexpected condition were both found in dual task in Experiment 3a (t(19)=4.82, p<0.01, Cohen's d=0.75) and 3b (t(19)=6.28, p<0.001, Cohen's d=1.32), only under HSF condition there was significant difference ofβ(Exp. 3b, t(19)=3.54, p<0.05, Cohen's d=0.79; Exp. 3a: t(19) = 2.08, p > 0.05), suggesting that though LSF components of scene can be more efficiently processed at early stages of visual areas, HSF components are essential to accomplish rapid natural scene recognition.Results of current research provide solid support for theories that our visual system combines both stimuli-driven feedforward signals and feedback information of prior expectations to accomplish rapid natural scene recognition. The influences of contextual expectations on rapid natural scene recognition is quite different between spatial frequency components processed with separate visual pathway, also both components are essential to accomplish the recognition of a rapidly presented natural scene.

The Roles of Contextual Semantic Relevance Metrics in Human Visual Processing

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

The Impact of Contextual Expectation on Rapid Natural Scene Recognition

Attention-aware semantic relevance predicting Chinese sentence reading

Optimizing Predictive Metrics for Human Reading Behavior

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

On Semantic-Instructed Attention: from Video Eye-Tracking Dataset to Memory-Guided Probabilistic Saliency Model.

Visual Semantic Information Pursuit: A Survey

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Analysis of rhizobacterial communities in perennial Graminaceae from polluted water meadow soil, and screening of metal-resistant, potentially plant growth-promoting bacteria.

Putting visual object recognition in context

Probing the Link Between Vision and Language in Material Perception Using Psychophysics and Unsupervised Learning

Predicting human gaze beyond pixels.

The Impact of Scene Context on Visual Object Recognition: Comparing Humans, Monkeys, and Computational Models

Beyond visual semantics: Exploring the role of scene text in image understanding

Metric networks for enhanced perception of non-local semantic information