Semantic Image Segmentation with Improved Position Attention and Feature Fusion

Hegui Zhu,Yan Miao,Xiangde Zhang
DOI: https://doi.org/10.1007/s11063-020-10240-9
IF: 2.565
2020-05-12
Neural Processing Letters
Abstract:Encoder–decoder structure is an universal method for semantic image segmentation. However, some important information of images will lost with the increasing depth of convolutional neural network (CNN), and the correlation between arbitrary pixels will get worse. This paper designs a novel image segmentation model to obtain dense feature maps and promote segmentation effects. In encoder stage, we employ ResNet-50 to extract features, and then add a spatial pooling pyramid (SPP) to achieve multi-scale feature fusion. In decoder stage, we provide an improved position attention module to integrate contextual information effectively and remove the trivial information through changing the construction way of attention matrix. Furthermore, we also propose the feature fusion structure to generate dense feature maps by preforming element–wise sum operation on the upsampling features and corresponding encoder features. The simulation results illustrate that the average accuracy and mIOU on CamVid dataset can reach 90.7% and 63.1% respectively. It verifies the effectiveness and reliability of the proposed method.
computer science, artificial intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problems of important information loss and weakened inter - pixel correlation in semantic image segmentation due to the increase in the depth of convolutional neural networks (CNNs). Specifically, as the number of convolutional layers increases, the spatial resolution of the image decreases, and position, boundary, and detail information are lost, resulting in sparse feature maps. To solve these problems, the author proposes a new image segmentation model (AFF - Net) based on an encoder - decoder structure, obtaining dense feature maps through an improved position - attention module and a feature - fusion structure, thereby enhancing the segmentation effect. ### Main contributions of the paper 1. **Proposing a new encoder - decoder structure (AFF - Net)**: This structure can obtain dense feature maps and improve the segmentation effect. 2. **Designing an improved position - attention module**: By introducing judgment conditions, global features can be integrated more effectively. 3. **Proposing a feature - fusion structure**: These structures can generate dense feature maps. 4. **Experimental results show**: On the CamVid dataset, the average accuracy and mIOU of AFF - Net reach 90.7% and 63.1% respectively, outperforming other comparable algorithms. ### Specific improvements of the model - **Encoder stage**: Use ResNet - 50 to extract features and add a spatial pyramid pooling (SPP) module to achieve multi - scale feature fusion. At the same time, save the maximum pooling index in each pooling layer to store position information. - **Decoder stage**: Gradually restore the resolution and lost information through five - step operations. Each step includes an up - sampling operation, a feature - fusion structure, and a convolutional operation. Among them, the improved position - attention module is only added before the first step to effectively integrate context information and remove irrelevant information. ### Improved position - attention module The position - attention module aims to enhance the relationship between any pixels in the feature map. Improvements in the paper include: - Using single - channel convolution instead of multi - channel convolution to reduce computational cost. - Adding a convolutional layer to introduce more abundant information. - Introducing a judgment mechanism to remove some unimportant information through the hyper - parameter \(\beta\). The formula is as follows: \[ M_{A_{k,t}} = \begin{cases} M'_{A_{k,t}}, & \text{if } M'_{A_{k,t}} \geq \beta \\ 0, & \text{otherwise} \end{cases} \] where \(M'_{A}\) is the original attention matrix, \(\beta\in[0, 1]\) is a hyper - parameter used to control the retained attention value. ### Experimental results The experiment was carried out on the CamVid dataset, and the evaluation metrics were average accuracy and mIOU. The experimental results show that when \(\beta = 0.35\), the model has the best performance, with an average accuracy of 90.7% and an mIOU of 63.1%. In conclusion, this paper significantly improves the effect of semantic image segmentation through the improved position - attention module and feature - fusion structure, solving the problems of information loss and weakened pixel correlation in traditional methods.