DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Zilu Guo,Liuyang Bian,Xuan Huang,Hu Wei,Jingyu Li,Huasheng Ni
2024-06-06
Abstract:Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: <a class="link-external link-https" href="https://github.com/takaniwa/DSNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of effectively utilizing atrous convolutions in semantic segmentation tasks, especially in the shallow layers of the model. Specifically, the paper proposes the following points: 1. **Revisiting the Design of Atrous Convolutions in Modern Convolutional Neural Networks (CNNs)**: The authors find that although atrous convolutions can increase the receptive field, previous works rarely apply them to the shallow layers of the model. Through a series of experiments, they propose three guiding principles for applying atrous convolutions more efficiently. 2. **Introducing the DSNet Architecture**: Based on the aforementioned principles, the authors designed a dual-branch CNN architecture named DSNet, which integrates atrous convolutions into the shallow layers of the model, and almost the entire encoder has been pre-trained on ImageNet to achieve better performance. A key feature of DSNet is the use of a Multi-Scale Attention Fusion module (MSAF) between the two branches, which helps balance detail and contextual information. 3. **Achieving a New Trade-off Between Speed and Accuracy**: The paper demonstrates that DSNet achieves a new trade-off between speed and accuracy on the ADE20K, Cityscapes, and BDD datasets, particularly excelling in real-time semantic segmentation and high-precision semantic segmentation. For example, on ADE20K, DSNet reached 40.0% mean Intersection over Union (mIOU) with an inference speed of 179.2 FPS; on Cityscapes, it achieved an mIOU of 80.4% at a speed of 81.9 FPS. 4. **Exploring the Atrous Disasters Phenomenon**: The paper discusses the negative impact of using excessively large atrous rates during ImageNet pre-training, a phenomenon referred to as "Atrous Disasters". To avoid this issue, the authors suggest that choosing an appropriate atrous rate is crucial to ensure effective feature representation during the pre-training phase. 5. **Proposing Multi-Scale Fusion Atrous Convolution Blocks (MFACB) and Serial-Parallel Atrous Spatial Pyramid Pooling (SPASPP)**: To enhance the model's perception of information at different scales, MFACB and SPASPP modules are introduced. The former allows the model to effectively learn semantic information at various scales, while the latter further extracts contextual information, rapidly expanding the receptive field. In summary, the paper aims to improve the efficiency and performance of models in semantic segmentation tasks by improving the way atrous convolutions are used and proposing innovative network architectures, especially in scenarios requiring real-time processing and high precision.