Hybrid CNN-ViT architecture to exploit spatio-temporal feature for fire recognition trained through transfer learning

Mohammad Shahid,Hong-Cyuan Wang,Yung-Yao Chen,Kai-Lung Hua
DOI: https://doi.org/10.1007/s11042-024-18752-5
IF: 2.577
2024-03-27
Multimedia Tools and Applications
Abstract:Fires are becoming one of the major natural hazards that threaten the ecology, economy, human life and even more worldwide. Therefore, early fire detection systems are crucial to prevent fires from spreading out of control and causing destruction. Based on vision sensors, many fire detection techniques have evolved with the recent surge of curiosity in deep learning, which exploits the spatial features of individual images. However, fire can take different forms, scales, and combustion materials can produce different colors, making accurate fire detection from an image challenging. Small fires captured from long-distance cameras lack salient features, further complicating detection. This paper proposes a hybrid structure that uses attention-enhanced convolutional neural networks and vision transformers (CNN-ViT) to detect fire. The proposed CNN-ViT first pays spatial attention to every frame and then aggregates temporal contextual information from neighboring frames to improve detection performance. Due to the limited availability of training fire datasets, the study employs deep transfer learning for feature extraction using pre-trained CNN. We used various metrics to examine the efficacy of the proposed approach. The results showed that the CNN-ViT method outperformed previous models based on spatial-temporal characteristics by achieving a relative improvement in accuracy and F1 score. The satisfactory results on images contaminated with different intensities of noise confirm the robustness of the approach.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?