Visualization Comparison of Vision Transformers and Convolutional Neural Networks

Rui Shi,Tianxing Li,Liguo Zhang,Yasushi Yamaguchi
DOI: https://doi.org/10.1109/tmm.2023.3294805
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Recent research has demonstrated that Vision Transformers (ViTs) are capable of comparable or even better performance than convolutional neural network (CNN) baselines. The differences in their structural designs are obvious, but our understanding of the differences in their feature representations remains limited. In this work, we propose several techniques to achieve high-quality visualization of representations in ViTs. Both qualitative and quantitative experiments show that our technical improvements can observably improve ViT visualization quality compared to previous studies. Furthermore, we conduct visualizations to explore the disparities between ViTs and CNNs pre-trained on ImageNet1K, revealing three intriguing properties of ViTs: (a) ViT feature propagation retains image detail information with minimal loss, whereas CNNs discard most image details for class discrimination. (b) Different from CNNs, object-related features do not show in ViT higher layers, suggesting that class-discriminative features may not be required for ViT classification. (c) Our visualization-assisted texture-bias experiment reveals that both ViTs and CNNs exhibit texture bias, of which ViTs seem to be more biased towards local textures.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?