CrackViT: a unified CNN-transformer model for pixel-level crack extraction

Jianing Quan,Baozhen Ge,Min Wang
DOI: https://doi.org/10.1007/s00521-023-08277-7
2023-01-31
Neural Computing and Applications
Abstract:Pixel-level crack extraction (PCE) is challenging due to topology complexity, irregular edges, low contrast ratio, and complex background. Recently, Transformer architectures have shown great potential on many vision tasks and even outperform convolutional neural networks (CNNs). Benefiting from the self-attention mechanism, Transformers can invariably capture the global context information to establish long-range dependencies on the detected objects. However, there was little work on the Transformer architectures for PCE. In this paper, a systematic analysis of three well-designed Transformer architectures for PCE task in terms of network structures and parameters, feature fusion modes, training data and strategy, and generalization ability was developed for the first time. We proposed a Crack extraction network with Vision Transformer (CrackViT) that jointly captures the detailed structures and long-distance dependencies with a novel hybrid encoder with CNN and Transformer to keep the corresponding topologies. In order to be more suitable for PCE task, we explored three feature fusion modes between CNN and Transformer. In addition, a novel feature aggregation block was proposed to sharpen the edges of the decoder upsampling and reduce the noise effect of shallow features. Moreover, a multi-task supervised training strategy was adopted to further improve the details of crack edges. Results on four challenging datasets, including CrackForest, DeepCrack, CRKWH100, and CRACK500, show that CrackViT outperforms state-of-the-art CNN-based methods and the other two novel Transformer architectures. Our codes are available at: https://github.com/SmilQe/CrackViT.
computer science, artificial intelligence
What problem does this paper attempt to address?