Abstract:Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: <a class="link-external link-https" href="https://github.com/takaniwa/DSNet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of effectively utilizing atrous convolutions in semantic segmentation tasks, especially in the shallow layers of the model. Specifically, the paper proposes the following points: 1. **Revisiting the Design of Atrous Convolutions in Modern Convolutional Neural Networks (CNNs)**: The authors find that although atrous convolutions can increase the receptive field, previous works rarely apply them to the shallow layers of the model. Through a series of experiments, they propose three guiding principles for applying atrous convolutions more efficiently. 2. **Introducing the DSNet Architecture**: Based on the aforementioned principles, the authors designed a dual-branch CNN architecture named DSNet, which integrates atrous convolutions into the shallow layers of the model, and almost the entire encoder has been pre-trained on ImageNet to achieve better performance. A key feature of DSNet is the use of a Multi-Scale Attention Fusion module (MSAF) between the two branches, which helps balance detail and contextual information. 3. **Achieving a New Trade-off Between Speed and Accuracy**: The paper demonstrates that DSNet achieves a new trade-off between speed and accuracy on the ADE20K, Cityscapes, and BDD datasets, particularly excelling in real-time semantic segmentation and high-precision semantic segmentation. For example, on ADE20K, DSNet reached 40.0% mean Intersection over Union (mIOU) with an inference speed of 179.2 FPS; on Cityscapes, it achieved an mIOU of 80.4% at a speed of 81.9 FPS. 4. **Exploring the Atrous Disasters Phenomenon**: The paper discusses the negative impact of using excessively large atrous rates during ImageNet pre-training, a phenomenon referred to as "Atrous Disasters". To avoid this issue, the authors suggest that choosing an appropriate atrous rate is crucial to ensure effective feature representation during the pre-training phase. 5. **Proposing Multi-Scale Fusion Atrous Convolution Blocks (MFACB) and Serial-Parallel Atrous Spatial Pyramid Pooling (SPASPP)**: To enhance the model's perception of information at different scales, MFACB and SPASPP modules are introduced. The former allows the model to effectively learn semantic information at various scales, while the latter further extracts contextual information, rapidly expanding the receptive field. In summary, the paper aims to improve the efficiency and performance of models in semantic segmentation tasks by improving the way atrous convolutions are used and proposing innovative network architectures, especially in scenarios requiring real-time processing and high precision.

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation

Cascaded Multiscale Structure with Self-Smoothing Atrous Convolution for Semantic Segmentation

Rethinking Atrous Convolution for Semantic Image Segmentation

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network

Hybrid Dilated Convolution Network Using Attentive Kernels for Real-Time Semantic Segmentation

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Atrous Convolutional Neural Network (ACNN) for Semantic Image Segmentation with Full-Scale Feature Maps

Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation

AtICNet: Semantic Segmentation with Atrous Spatial Pyramid Pooling in Image Cascade Network

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

EADNet: Efficient Asymmetric Dilated Network for Semantic Segmentation

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Real-Time High-Performance Semantic Image Segmentation of Urban Street Scenes

See more than once: Kernel-sharing atrous convolution for semantic segmentation

DPNet: Dual-Pyramid Semantic Segmentation Network Based on Improved Deeplabv3 Plus

Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes

Aggregation Architecture and All-to-one Network for Real-Time Semantic Segmentation

Dynamic Sampling Network for Semantic Segmentation

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation