Abstract:Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by proposing a new segmentation model called VTrUNet which consists of a virtual band construction module to capture spectral patterns and a transformer boosted UNet to capture long range contextual features. The model takes imagery of six bands: red, green, blue, near infrared, and two shortwave infrared bands as input. To show the advantages of the proposed model, the paper presents extensive results for various possible model architectures improving UNet and draws interesting conclusions including that adding more modules to a model does not always lead to a better performance. The paper also compares the proposed model with very recently proposed and related models for smoke segmentation and shows that the proposed model performs the best and makes significant improvements on prediction performances

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multi - spectral LandSat satellite images with complex backgrounds, the existing smoke segmentation methods cannot effectively detect all types of smoke. Specifically, smoke faces the following challenges during detection: 1. **Changes in density, color, and lighting conditions**: These characteristics of smoke make it difficult to distinguish from the background. 2. **Background interference**: Background factors such as clouds and haze increase the difficulty of detection. 3. **Contextual nature of thin smoke**: Thin smoke may appear as an object or background in different backgrounds, depending on its relationship with the foreground. To solve these problems, the authors propose a new segmentation model - VTrUNet (Virtual band construction and Transformer - boosted UNet). This model contains two main modules: - **Virtual band construction module (VC)**: Used to capture spectral patterns by expanding the number of channels of the input image to represent different spectral features. - **Enhanced UNet module (TrUNet)**: A vision transformer combined with a self - attention mechanism is used to capture long - distance contextual features. ### Model structure The architecture of VTrUNet is shown in Figure 2 and specifically includes the following parts: 1. **Virtual band construction module (VC)**: - Input a 6 - channel image (red, green, blue, near - infrared, short - wave infrared 1, short - wave infrared 2) and output a 64 - channel tensor. - Use convolution kernels of different sizes (1x1, 3x3, 5x5) to extract features in different ranges and concatenate them along the channel dimension. 2. **Enhanced UNet module (TrUNet)**: - Add a Transformer block (TrfB) at each UNet level to capture long - distance correlations. - The input image at each level passes through a convolution block (ConvB) and is then divided into two paths: - Upper path: Extract long - distance features through the Transformer block. - Lower path: Directly pass through the residual connection to the right side. - On the right side, the output of the Transformer block, the output of the residual path, and the up - sampled output from the next level are concatenated, and then pass through a convolution block to generate the output of this level. 3. **Multi - layer perceptron (MLP)**: - Used to predict the class of each pixel (smoke, cloud, clear background), and the output is an RGB image, where the red channel corresponds to smoke, the green channel corresponds to cloud, and the blue channel corresponds to clear background. ### Evaluation metrics To evaluate the model performance, the authors introduce an improved F1 score (F1h), which takes into account the unlabeled areas (gaps) in the partially labeled data. It is specifically defined as follows: \[ \text{prec}(c_i)=\frac{\text{pn}(\hat{c}_i\cap\tilde{c}_i)}{\text{pn}(\hat{c}_i)} \] \[ \text{rec}(c_i)=\frac{\text{pn}(\hat{c}_i\cap\tilde{c}_i)}{\text{pn}(\tilde{c}_i)} \] \[ \text{F1}(c_i) = 2\cdot\frac{\text{prec}(c_i)\cdot\text{rec}(c_i)}{\text{prec}(c_i)+\text{rec}(c_i)} \] The correction factor \(r_h(c_i)\) is defined as: \[ r_h(c_i)=\frac{\text{pn}(\hat{c}_{i,h})}{\text{pn}(\hat{c}_i)+\frac{\text{pn}(h)}{N}} \] The final corrected F1 score is: \[ \text{F1}_h(c_i)=\text{F1}(c_i)\cdot(1 - r_h(c_i)) \] ### Experimental results The experiment uses the multi - spectral image data sets of Landsat 5 and Landsat 8.

A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagery

TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images

An Optimized Smoke Segmentation Method for Forest and Grassland Fire Based on the UNet Framework

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

Forest Fire Segmentation via Temporal Transformer from Aerial Images

Application of Segmented Transformer Feature Extraction in Near Infrared Spectral Data Classification

Convolution-Enhanced Vision Transformer Network for Smoke Recognition

UNeXt: An Efficient Network for the Semantic Segmentation of High-Resolution Remote Sensing Images

Wildfire Smoke Detection with Cross Contrast Patch Embedding

SW-UNet: a U-Net fusing sliding window transformer block with CNN for segmentation of lung nodules

Going Beyond U-Net: Assessing Vision Transformers for Semantic Segmentation in Microscopy Image Analysis

Semantic Segmentation and Analysis on Sensitive Parameters of Forest Fire Smoke Using Smoke-Unet and Landsat-8 Imagery

HSP-UNet: An Accuracy and Efficient Segmentation Method for Carbon Traces of Surface Discharge in the Oil-Immersed Transformer

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers

STransU2Net: Transformer based hybrid model for building segmentation in detailed satellite imagery

A lightweight network for real-time smoke semantic segmentation based on dual paths

MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

UAVformer: A Composite Transformer Network for Urban Scene Segmentation of UAV Images

Deep Learning and Transformer Approaches for UAV-Based Wildfire Detection and Segmentation