ACC-UNet: A Completely Convolutional UNet model for the 2020s

Nabil Ibtehaz,Daisuke Kihara
2023-08-26
Abstract:This decade is marked by the introduction of Vision Transformer, a radical paradigm shift in broad computer vision. A similar trend is followed in medical imaging, UNet, one of the most influential architectures, has been redesigned with transformers. Recently, the efficacy of convolutional models in vision is being reinvestigated by seminal works such as ConvNext, which elevates a ResNet to Swin Transformer level. Deriving inspiration from this, we aim to improve a purely convolutional UNet model so that it can be on par with the transformer-based models, e.g, Swin-Unet or UCTransNet. We examined several advantages of the transformer-based UNet models, primarily long-range dependencies and cross-level skip connections. We attempted to emulate them through convolution operations and thus propose, ACC-UNet, a completely convolutional UNet model that brings the best of both worlds, the inherent inductive biases of convnets with the design decisions of transformers. ACC-UNet was evaluated on 5 different medical image segmentation benchmarks and consistently outperformed convnets, transformers, and their hybrids. Notably, ACC-UNet outperforms state-of-the-art models Swin-Unet and UCTransNet by $2.64 \pm 2.54\%$ and $0.45 \pm 1.61\%$ in terms of dice score, respectively, while using a fraction of their parameters ($59.26\%$ and $24.24\%$). Our codes are available at <a class="link-external link-https" href="https://github.com/kiharalab/ACC-UNet" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of improving the U-Net model solely through convolution operations without using transformers, so that its performance can be comparable to transformer-based models (such as Swin-U-Net or UCTransNet). The authors are inspired by recent research showing that pure convolutional models remain competitive in visual tasks and can even achieve performance comparable to transformers in some cases. Therefore, the authors designed a novel fully convolutional U-Net model—ACC-UNet, aiming to integrate the inherent inductive bias of convolutional neural networks (CNNs) with the design principles of transformers. Specifically, ACC-UNet achieves this goal through the following methods: 1. **Long-Range Dependencies**: By introducing the Hierarchical Aggregation of Neighborhood Context (HANC) module, which mimics the self-attention mechanism in transformers. 2. **Multi-Level Feature Compilation**: Through the Multi-Level Feature Compilation (MLFC) module, which fuses feature maps between different encoder levels to enhance the feature representation capability of individual levels. Experimental results show that ACC-UNet performs excellently on 5 different medical image segmentation benchmark datasets, surpassing traditional convolutional models, transformer models, and their hybrid models, while significantly reducing the number of parameters. This indicates that applying some design principles of transformers to a pure convolutional U-Net model can indeed enhance its performance.