Convolution-Enhanced Vision Transformer Network for Smoke Recognition

Cheng, Guangtao
DOI: https://doi.org/10.1007/s10694-023-01378-8
IF: 3.605
2023-02-19
Fire Technology
Abstract:Visual smoke recognition remains a substantial challenging task due to: (1) the large variations of smoke color, texture, brightness and shape caused by complex environment; (2) the difficulties in data collection and insufficient smoke datasets. The novel Transformer has attracted increasing interests in computer vision, but it still falls behind state-of-the-art convolutional neural networks when trained on limited datasets. To improve the visual feature representation of smoke image and address the problem of too few smoke datasets in real scenes, this paper proposes a new convolution-enhanced vision Transformer network (CViTNet) for smoke recognition by introducing desirable properties of convolutional neural network into vision Transformer. Instead of the straight tokenization in vision Transformer, we firstly revisit the merits of convolutional neural network and design convolutional token embedding by overlapping convolution operation with stride on the token feature maps, achieving feature resolution reduction and channel capacity expansion. We then partition vision Transformer into multiple stages by convolutional token embedding and construct a hierarchical structure to enhance feature representation and reduce computational complexity. CViTNet enjoys the advantages of both CNN and Transformer. Finally, we validate our approach by conducting extensive experiments, showing that CViTNet is establishing a new stage-of-the-art detection accuracy that exceeds 99.54 on average with 4.49 M learnable parameters and 346 M FLOPs.
materials science, multidisciplinary,engineering
What problem does this paper attempt to address?