Abstract:The application of 3D ViTs to medical image segmentation has seen remarkable strides, somewhat overshadowing the budding advancements in Convolutional Neural Network (CNN)-based models. Large kernel depthwise convolution has emerged as a promising technique, showcasing capabilities akin to hierarchical transformers and facilitating an expansive effective receptive field (ERF) vital for dense predictions. Despite this, existing core operators, ranging from global-local attention to large kernel convolution, exhibit inherent trade-offs and limitations (e.g., global-local range trade-off, aggregating attentional features). We hypothesize that deformable convolution can be an exploratory alternative to combine all advantages from the previous operators, providing long-range dependency, adaptive spatial aggregation and computational efficiency as a foundation backbone. In this work, we introduce 3D DeformUX-Net, a pioneering volumetric CNN model that adeptly navigates the shortcomings traditionally associated with ViTs and large kernel convolution. Specifically, we revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency. Inspired by the concepts of structural re-parameterization for convolution kernel weights, we further generate the deformable tri-planar offsets by adapting a parallel branch (starting from $1\times1\times1$ convolution), providing adaptive spatial aggregation across all channels. Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models across four challenging public datasets, spanning various scales from organs (KiTS: 0.680 to 0.720, MSD Pancreas: 0.676 to 0.717, AMOS: 0.871 to 0.902) to vessels (e.g., MSD hepatic vessels: 0.635 to 0.671) in mean Dice.

DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image Segmentation with Depthwise Deformable Convolution

Adaptive Decomposition and Shared Weight Volumetric Transformer Blocks for Efficient Patch-Free 3D Medical Image Segmentation.

DeU-Net: Deformable U-Net for 3D Cardiac MRI Video Segmentation

3D Multiple-Contextual ROI-Attention Network for Efficient and Accurate Volumetric Medical Image Segmentation.

A Lightweight Deep Network for 3D Medical Image Segmentation.

DeU-Net 2.0: Enhanced deformable U-Net for 3D cardiac cine MRI segmentation

3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation

3D ConvNet+: A lightweight adaptive network for 3D medical image segmentation

Deep Sequential Segmentation of Organs in Volumetric Medical Scans

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

A More Design-Flexible Medical Transformer for Volumetric Image Segmentation.

Scaling Up 3D Kernels with Bayesian Frequency Re-parameterization for Medical Image Segmentation

A 3D Convolutional Neural Network for Volumetric Image Semantic Segmentation

3D LVCN: A Lightweight Volumetric ConvNet

3D Tiled Convolution for Effective Segmentation of Volumetric Medical Images

A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation

Deform U-Net: Unsupervised Deformable 3D Biomedical Image Registration Network

TransBTSV2: Wider Instead of Deeper Transformer for Medical Image Segmentation

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

D-Net: Dynamic Large Kernel with Dynamic Feature Fusion for Volumetric Medical Image Segmentation