Abstract:Humans actively observe the visual surroundings by focusing on salient objects and ignoring trivial details. However, computer vision models based on convolutional neural networks (CNN) often analyze visual input all at once through a single feed-forward pass. In this study, we designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. Trained on image recognition, this model examines an image through a sequence of fixations, each time focusing on different parts, thereby progressively building a representation of the image. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior and adversarial robustness. Our findings suggest that the model can attend and gaze in ways similar to humans without being explicitly trained to mimic human attention, and that the model can enhance robustness against adversarial attacks due to its retinal sampling and recurrent processing. In particular, the model can correct its perceptual errors by taking more glances, setting itself apart from all feed-forward-only models. In conclusion, the interactions of retinal sampling, eye movement, and recurrent dynamics are important to human-like visual exploration and inference.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the robustness of computer vision models (particularly those based on convolutional neural networks) under adversarial noise and attempts to improve this performance by mimicking the human visual system. **Specific objectives include:** 1. **Mimicking the Attention Mechanism of the Human Visual System:** - Unlike traditional convolutional neural networks, the human visual system actively focuses on salient regions while ignoring unimportant details. Therefore, the researchers designed a dual-stream visual model with a retina-like input layer, containing two streams: one for determining the next fixation point and the other for interpreting the visual information around that fixation point. 2. **Improving Adversarial Robustness:** - The study found that this dual-stream model can perform fixation and attention in a human-like manner without explicit training to mimic human attention. Additionally, due to its retinal sampling and recursive processing, the model can enhance robustness against adversarial attacks. Specifically, the model can correct perception errors through multiple "fixations," distinguishing it from all purely feedforward models. 3. **Validating Model Performance:** - The study evaluated the model's performance in object recognition, fixation behavior, and adversarial robustness across multiple benchmarks, demonstrating that the model can perform fixation and attention in a human-like manner and excels in adversarial robustness. Through these methods, the researchers hope to develop a computer vision model that more closely resembles the human visual system, thereby improving its robustness and accuracy in real-world applications.

Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises