Abstract:Encoder–decoder structure is an universal method for semantic image segmentation. However, some important information of images will lost with the increasing depth of convolutional neural network (CNN), and the correlation between arbitrary pixels will get worse. This paper designs a novel image segmentation model to obtain dense feature maps and promote segmentation effects. In encoder stage, we employ ResNet-50 to extract features, and then add a spatial pooling pyramid (SPP) to achieve multi-scale feature fusion. In decoder stage, we provide an improved position attention module to integrate contextual information effectively and remove the trivial information through changing the construction way of attention matrix. Furthermore, we also propose the feature fusion structure to generate dense feature maps by preforming element–wise sum operation on the upsampling features and corresponding encoder features. The simulation results illustrate that the average accuracy and mIOU on CamVid dataset can reach 90.7% and 63.1% respectively. It verifies the effectiveness and reliability of the proposed method.

What problem does this paper attempt to address?

This paper attempts to solve the problems of important information loss and weakened inter - pixel correlation in semantic image segmentation due to the increase in the depth of convolutional neural networks (CNNs). Specifically, as the number of convolutional layers increases, the spatial resolution of the image decreases, and position, boundary, and detail information are lost, resulting in sparse feature maps. To solve these problems, the author proposes a new image segmentation model (AFF - Net) based on an encoder - decoder structure, obtaining dense feature maps through an improved position - attention module and a feature - fusion structure, thereby enhancing the segmentation effect. ### Main contributions of the paper 1. **Proposing a new encoder - decoder structure (AFF - Net)**: This structure can obtain dense feature maps and improve the segmentation effect. 2. **Designing an improved position - attention module**: By introducing judgment conditions, global features can be integrated more effectively. 3. **Proposing a feature - fusion structure**: These structures can generate dense feature maps. 4. **Experimental results show**: On the CamVid dataset, the average accuracy and mIOU of AFF - Net reach 90.7% and 63.1% respectively, outperforming other comparable algorithms. ### Specific improvements of the model - **Encoder stage**: Use ResNet - 50 to extract features and add a spatial pyramid pooling (SPP) module to achieve multi - scale feature fusion. At the same time, save the maximum pooling index in each pooling layer to store position information. - **Decoder stage**: Gradually restore the resolution and lost information through five - step operations. Each step includes an up - sampling operation, a feature - fusion structure, and a convolutional operation. Among them, the improved position - attention module is only added before the first step to effectively integrate context information and remove irrelevant information. ### Improved position - attention module The position - attention module aims to enhance the relationship between any pixels in the feature map. Improvements in the paper include: - Using single - channel convolution instead of multi - channel convolution to reduce computational cost. - Adding a convolutional layer to introduce more abundant information. - Introducing a judgment mechanism to remove some unimportant information through the hyper - parameter \(\beta\). The formula is as follows: \[ M_{A_{k,t}} = \begin{cases} M'_{A_{k,t}}, & \text{if } M'_{A_{k,t}} \geq \beta \\ 0, & \text{otherwise} \end{cases} \] where \(M'_{A}\) is the original attention matrix, \(\beta\in[0, 1]\) is a hyper - parameter used to control the retained attention value. ### Experimental results The experiment was carried out on the CamVid dataset, and the evaluation metrics were average accuracy and mIOU. The experimental results show that when \(\beta = 0.35\), the model has the best performance, with an average accuracy of 90.7% and an mIOU of 63.1%. In conclusion, this paper significantly improves the effect of semantic image segmentation through the improved position - attention module and feature - fusion structure, solving the problems of information loss and weakened pixel correlation in traditional methods.

Semantic Image Segmentation with Improved Position Attention and Feature Fusion

Research of improving semantic image segmentation based on a feature fusion model

Research on Image Semantic Segmentation Based on Hybrid Cascade Feature Fusion and Detailed Attention Mechanism

High-Resolution Remote Sensing Image Semantic Segmentation Method Based on Improved Encoder-Decoder Convolutional Neural Network

Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion Utilizing a Multi-Scale Dilated Convolutional Pyramid

Enhancing Feature Fusion with Spatial Aggregation and Channel Fusion for Semantic Segmentation

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

MFEAFN: Multi-scale feature enhanced adaptive fusion network for image semantic segmentation

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network

Semantic-Aware Fusion Network Based on Super-Resolution

Multilevel feature fusion dilated convolutional network for semantic segmentation

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Remote Sensing Image Semantic Segmentation Method Based on a Deep Convolutional Neural Network and Multiscale Feature Fusion

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Bilateral attention decoder: A lightweight decoder for real-time semantic segmentation

Based on cross-scale fusion attention mechanism network for semantic segmentation for street scenes

A Deep Fully Convolution Neural Network for Semantic Segmentation Based on Adaptive Feature Fusion

DSNet:Multi-resolution Dense Encoder and Stack Decoder Network for Aerial Image Segmentation

Semantic Segmentation via Highly Fused Convolutional Network with Multiple Soft Cost Functions