A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagery

Jixue Liu,Jiuyong Li,Stefan Peters,Liang Zhao
2024-06-19
Abstract:Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by proposing a new segmentation model called VTrUNet which consists of a virtual band construction module to capture spectral patterns and a transformer boosted UNet to capture long range contextual features. The model takes imagery of six bands: red, green, blue, near infrared, and two shortwave infrared bands as input. To show the advantages of the proposed model, the paper presents extensive results for various possible model architectures improving UNet and draws interesting conclusions including that adding more modules to a model does not always lead to a better performance. The paper also compares the proposed model with very recently proposed and related models for smoke segmentation and shows that the proposed model performs the best and makes significant improvements on prediction performances
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in multi - spectral LandSat satellite images with complex backgrounds, the existing smoke segmentation methods cannot effectively detect all types of smoke. Specifically, smoke faces the following challenges during detection: 1. **Changes in density, color, and lighting conditions**: These characteristics of smoke make it difficult to distinguish from the background. 2. **Background interference**: Background factors such as clouds and haze increase the difficulty of detection. 3. **Contextual nature of thin smoke**: Thin smoke may appear as an object or background in different backgrounds, depending on its relationship with the foreground. To solve these problems, the authors propose a new segmentation model - VTrUNet (Virtual band construction and Transformer - boosted UNet). This model contains two main modules: - **Virtual band construction module (VC)**: Used to capture spectral patterns by expanding the number of channels of the input image to represent different spectral features. - **Enhanced UNet module (TrUNet)**: A vision transformer combined with a self - attention mechanism is used to capture long - distance contextual features. ### Model structure The architecture of VTrUNet is shown in Figure 2 and specifically includes the following parts: 1. **Virtual band construction module (VC)**: - Input a 6 - channel image (red, green, blue, near - infrared, short - wave infrared 1, short - wave infrared 2) and output a 64 - channel tensor. - Use convolution kernels of different sizes (1x1, 3x3, 5x5) to extract features in different ranges and concatenate them along the channel dimension. 2. **Enhanced UNet module (TrUNet)**: - Add a Transformer block (TrfB) at each UNet level to capture long - distance correlations. - The input image at each level passes through a convolution block (ConvB) and is then divided into two paths: - Upper path: Extract long - distance features through the Transformer block. - Lower path: Directly pass through the residual connection to the right side. - On the right side, the output of the Transformer block, the output of the residual path, and the up - sampled output from the next level are concatenated, and then pass through a convolution block to generate the output of this level. 3. **Multi - layer perceptron (MLP)**: - Used to predict the class of each pixel (smoke, cloud, clear background), and the output is an RGB image, where the red channel corresponds to smoke, the green channel corresponds to cloud, and the blue channel corresponds to clear background. ### Evaluation metrics To evaluate the model performance, the authors introduce an improved F1 score (F1h), which takes into account the unlabeled areas (gaps) in the partially labeled data. It is specifically defined as follows: \[ \text{prec}(c_i)=\frac{\text{pn}(\hat{c}_i\cap\tilde{c}_i)}{\text{pn}(\hat{c}_i)} \] \[ \text{rec}(c_i)=\frac{\text{pn}(\hat{c}_i\cap\tilde{c}_i)}{\text{pn}(\tilde{c}_i)} \] \[ \text{F1}(c_i) = 2\cdot\frac{\text{prec}(c_i)\cdot\text{rec}(c_i)}{\text{prec}(c_i)+\text{rec}(c_i)} \] The correction factor \(r_h(c_i)\) is defined as: \[ r_h(c_i)=\frac{\text{pn}(\hat{c}_{i,h})}{\text{pn}(\hat{c}_i)+\frac{\text{pn}(h)}{N}} \] The final corrected F1 score is: \[ \text{F1}_h(c_i)=\text{F1}(c_i)\cdot(1 - r_h(c_i)) \] ### Experimental results The experiment uses the multi - spectral image data sets of Landsat 5 and Landsat 8.