Encoder-Decoder based CNN and Fully Connected CRFs for Remote Sensed Image Segmentation

Vikas Agaradahalli Gurumurthy
DOI: https://doi.org/10.48550/arXiv.1910.06041
2019-10-14
Abstract:With the advancement of remote-sensed imaging large volumes of very high resolution land cover images can now be obtained. Automation of object recognition in these 2D images, however, is still a key issue. High intra-class variance and low inter-class variance in Very High Resolution (VHR) images hamper the accuracy of prediction in object recognition tasks. Most successful techniques in various computer vision tasks recently are based on deep supervised learning. In this work, a deep Convolutional Neural Network (CNN) based on symmetric encoder-decoder architecture with skip connections is employed for the 2D semantic segmentation of most common land cover object classes - impervious surface, buildings, low vegetation, trees and cars. Atrous convolutions are employed to have large receptive field in the proposed CNN model. Further, the CNN outputs are post-processed using Fully Connected Conditional Random Field (FCRF) model to refine the CNN pixel label predictions. The proposed CNN-FCRF model achieves an overall accuracy of 90.5% on the ISPRS Vaihingen Dataset.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the automation of object recognition in high - resolution remote - sensing images. Specifically, the author focuses on the task of 2D semantic segmentation in very - high - resolution (VHR) images. Such images are characterized by large intra - class variance and small inter - class variance, which makes accurate prediction difficult. ### Problem Background With the progress of remote - sensing imaging technology, a large number of very - high - resolution surface - cover images can be obtained now. However, the automatic identification of objects in these 2D images remains a key issue. Especially in urban scenes, the visual / spectral characteristics of different objects are similar, while those of the same type of objects may vary greatly, which poses a challenge to the segmentation algorithm. ### Solution To solve the above problems, the author proposes a deep convolutional neural network (CNN) based on a symmetric encoder - decoder architecture and uses atrous convolutions in the model to expand the receptive field. In addition, in order to further optimize the pixel - label prediction results, the author also introduces a Fully Connected Conditional Random Field (FCRF) model for post - processing based on the output of the CNN. ### Main Contributions 1. **Model Architecture**: A CNN model based on a symmetric encoder - decoder architecture is proposed, which includes skip connections and expands the receptive field through atrous convolutions. 2. **Receptive Field Expansion**: Atrous convolutions are used instead of increasing the number of convolutional layers or filter sizes to expand the receptive field, so as to obtain larger context information while keeping the computational cost unchanged. 3. **Post - processing Optimization**: The FCRF model is used to post - process the output of the CNN to smooth noise and improve the segmentation boundary. 4. **Experimental Verification**: Experiments were carried out on the ISPRS Vaihingen data set, and the results show that this method has achieved an overall accuracy of 90.5%. Through these improvements, this paper aims to improve the accuracy of semantic segmentation in remote - sensing images, especially when dealing with images with high intra - class differences and low inter - class differences.