Convolutional architectures are cortex-aligned de novo

Atlas Kazemian,Eric Elmoznino,Michael F. Bonner
DOI: https://doi.org/10.1101/2024.05.10.593623
2024-05-15
Abstract:What underlies the emergence of cortex-aligned representations in deep neural network models of vision? The success of widely varied architectures has motivated the prevailing hypothesis that large-scale pre-training is the primary factor underlying the similarities between brains and neural networks. Here, we challenge this view by revealing the role of architectural inductive biases in models with minimal training. We examined networks with varied architectures but no pre-training and quantified their ability to predict image representations in the visual cortices of both monkeys and humans. We found that cortex-aligned representations emerge in convolutional architectures that combine two key manipulations of dimensionality: compression in the spatial domain and expansion in the feature domain. We further show that the inductive biases of convolutional architectures are critical for obtaining performance gains from feature expansion - dimensionality manipulations were relatively ineffective in other architectures and in convolutional models with targeted lesions. Our findings suggest that the architectural constraints of convolutional networks are sufficiently close to the constraints of biological vision to allow many aspects of cortical visual representation to emerge even before synaptic connections have been tuned through experience.
Animal Behavior and Cognition
What problem does this paper attempt to address?
This paper discusses how convolutional architectures in deep neural networks (DNNs) can produce representations similar to the visual cortex of the brain without extensive pre-training. The researchers challenged the widely accepted view that extensive pre-training is the main factor in the similarity between DNNs and the brain. They quantified the ability of networks with different architectures but without pre-training to predict image representations in the visual cortex of monkeys and humans. The paper found that convolutional architectures combine two key dimensional operations: spatial domain compression and feature domain expansion, resulting in representations similar to the brain. Further research indicated that the prior bias of convolutional architectures is crucial for utilizing feature expansion, while other architectures and damaged convolutional models had poorer dimensional operations. These findings suggest that even before empirically adjusting synaptic connections, the architectural constraints of convolutional networks are close enough to the constraints of biological vision to evoke many visual representations of the brain. The researchers demonstrated the changes in encoding performance of untrained networks by increasing the number of random features for different architectures such as convolutional, fully-connected, and Transformer, and quantifying their ability to predict image responses in the visual cortex of monkeys and humans. The results showed that although all architectures benefited from dimension expansion, the performance improvement of convolutional architectures was significantly greater than that of other architectures, even in dimension-matched scenarios. Furthermore, the paper revealed the critical role of nonlinear activation functions and spatial locality of convolutional filters in the performance of convolutional networks. The encoding performance of the network significantly decreased when these key components were removed. In summary, this study emphasizes the importance of architectural biases in convolutional networks in forming representations similar to the visual cortex of the brain, even without extensive training. This suggests that although pre-training may be sufficient to induce brain-aligned representations in various architectures, the initial state of convolutional architectures already exhibits a considerable degree of brain alignment.