Abstract:Our visual system has the ability to categorize images of natural visual scenes at a remarkable speed. Some researches suggested that this ability was based on a fast feedforward sweep of visuomotor processing, as top-down signals that reflect strategic processing or attentional effects are too slow to exert any influences. But recent studies have provided new evidences that some local recurrent feedback connections in early visual areas, which are different from the slower attention-mediated processes, might also impact rapid scene recognition. In addition to attention-mediated processes, expectations are also known to greatly affect our experience of the world in a top-down way. In ambiguous situations, such as under a rapid serial visual presentation condition, knowledge of the world guides our interpretation of the sensory information and helps us recognize the scene quickly and accurately. In the present study, we investigated whether could contextual expectation affect rapid scene recognition and the mechanisms behind those processes at different stages of early visual areas.Experiment 1 used binocular rivalry paradigm. On each trial, we first presented a sequence of identical images to the two eyes in order to generate an expectation (animal/non-animal) about the next scene in the series. We followed this predictive sequence with a rivalry display in which the predicted category of scene was presented (36 ms) to one eye and a non-predicted category was presented to the other eye. There were three conditions: number of predictive sequence images (between 0 and 12), the predicted category (animal/non-animal), and the eye to which the "matching" rivalrous category was presented (left/right eye). Experiment 2 used dual task paradigm, in which participants were asked to accomplish a central word discrimination task and a peripheral natural scene categorization task at the same time. In the central task, two words selected from the categories of animal, plants or office equipment were displayed in the center, and participants were forced to determine whether they were from the same category. In peripheral task, natural scene images containing or not containing animals were flashed for 36ms at one of four corners randomly, and participants were instructed to respond to images that containing one or more animals. Experiment 3 consisted of 3a and 3b whose procedures were the same as Experiment 2 except that they used lower spatial frequencies (3a, LSF) and higher spatial frequencies (3b, HSF) components of the original scenes accordingly.The results showed that observers were more likely to perceive the predicted category of scene at the onset of rivalry, suggesting that expectation can bias subjective perception of incoming sensory information from rapidly presented scene. Results of Experiment 2 show that perception performance in single task of scene recognition was always better than dual task condition (t(19) = 4.65, p < 0.001, Cohen's d = 1.02). The results show that d' of expected condition was bigger than unexpected condition (t(19) = 5.07, p < 0.001, Cohen's d =0.91), as did β(t(19) =3.02, p<0.05, Cohen's d=0.51). These results suggested that expectations generated by visual words can influence both bias and discriminability of observers while detecting animals in rapidly represented natural scene images. LSF scene recognition in Experiment 3a was found significantly better than Experiment 3b's HSF condition (t(19) = 3.26, p < 0.05, Cohen's d = 1.07), which was consistent with results from previous researches that a coarse-to-fine process could account for efficient scene recognition (Musel et al., 2014). Although significant differences of d' between expected and unexpected condition were both found in dual task in Experiment 3a (t(19)=4.82, p<0.01, Cohen's d=0.75) and 3b (t(19)=6.28, p<0.001, Cohen's d=1.32), only under HSF condition there was significant difference ofβ(Exp. 3b, t(19)=3.54, p<0.05, Cohen's d=0.79; Exp. 3a: t(19) = 2.08, p > 0.05), suggesting that though LSF components of scene can be more efficiently processed at early stages of visual areas, HSF components are essential to accomplish rapid natural scene recognition.Results of current research provide solid support for theories that our visual system combines both stimuli-driven feedforward signals and feedback information of prior expectations to accomplish rapid natural scene recognition. The influences of contextual expectations on rapid natural scene recognition is quite different between spatial frequency components processed with separate visual pathway, also both components are essential to accomplish the recognition of a rapidly presented natural scene.

Putting visual object recognition in context

Learning Visual Context for Group Activity Recognition.

The Impact of Contextual Expectation on Rapid Natural Scene Recognition

The Impact of Scene Context on Visual Object Recognition: Comparing Humans, Monkeys, and Computational Models

The role of context in object recognition

A Context Model for Object Recognition Improved by Spatial Relationships

When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes

How to make face recognition work: The power of modeling context

Lost in Context: The Influence of Context on Feature Attribution Methods for Object Recognition

Connectivity-Inspired Network for Context-Aware Recognition

Context-LGM: Leveraging Object-Context Relation for Context-Aware Object Recognition

Object Recognition Based on Improved Context Model

Object Recognition Using Local Context Information

Context Matters: Distilling Knowledge Graph for Enhanced Object Detection

Towards Context-Aware Interaction Recognition for Visual Relationship Detection

Context understanding in computer vision: A survey

Multi-Modal Subjective Context Modelling and Recognition

Contextual associations represented both in neural networks and human behavior

Towards Context-aware Interaction Recognition.

Quantifying and Transferring Contextual Information in Object Detection

Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection