FCT: Fusing CNN and Transformer for Scene Classification

Yuxiang Xie,Jie Yan,Lai Kang,Yanming Guo,Jiahui Zhang,Xidao Luan
DOI: https://doi.org/10.1007/s13735-022-00252-7
2022-01-01
International Journal of Multimedia Information Retrieval
Abstract:Scene classification based on convolutional neural networks (CNNs) has achieved great success in recent years. In CNNs, the convolution operation performs well in extracting local features, but its ability to capture global feature representations is limited. In vision transformer (ViT), the self-attention mechanism can capture long-term feature dependencies, but it breaks down the details of local features. In this work, we make full use of the advantages of the CNN and ViT and propose a Transformer-based framework that combines CNN to improve the discriminative ability of features for scene classification. Specifically, we take the deep convolutional feature as the input and establish the scene Transformer module to extract the global feature in the scene image. An end-to-end scene classification framework called the FCT is built by fusing the CNN and scene Transformer module. Experimental results show that our FCT achieves a new state-of-the-art performance on two standard benchmarks MIT Indoor 67 and SUN 397, with the accuracy of 90.75% and 77.50%, respectively.
What problem does this paper attempt to address?