MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets

Siyi Du,Nourhan Bayasi,Ghassan Hamarneh,Rafeef Garbi
2024-06-07
Abstract:Despite its clinical utility, medical image segmentation (MIS) remains a daunting task due to images' inherent complexity and variability. Vision transformers (ViTs) have recently emerged as a promising solution to improve MIS; however, they require larger training datasets than convolutional neural networks. To overcome this obstacle, data-efficient ViTs were proposed, but they are typically trained using a single source of data, which overlooks the valuable knowledge that could be leveraged from other available datasets. Naivly combining datasets from different domains can result in negative knowledge transfer (NKT), i.e., a decrease in model performance on some domains with non-negligible inter-domain heterogeneity. In this paper, we propose MDViT, the first multi-domain ViT that includes domain adapters to mitigate data-hunger and combat NKT by adaptively exploiting knowledge in multiple small data resources (domains). Further, to enhance representation learning across domains, we integrate a mutual knowledge distillation paradigm that transfers knowledge between a universal network (spanning all the domains) and auxiliary domain-specific branches. Experiments on 4 skin lesion segmentation datasets show that MDViT outperforms state-of-the-art algorithms, with superior segmentation performance and a fixed model size, at inference time, even as more domains are added. Our code is available at <a class="link-external link-https" href="https://github.com/siyi-wind/MDViT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced in training Vision Transformers (ViTs) on small - scale medical image segmentation datasets. Specifically, although ViTs have shown potential in Medical Image Segmentation (MIS), they require more data to train than Convolutional Neural Networks (CNNs), which is often difficult to meet in practical applications, especially when dealing with small - scale datasets. In addition, simply combining datasets from different domains for use may lead to Negative Knowledge Transfer (NKT), that is, the performance of the model in some domains decreases because of the significant heterogeneity of data in different domains. To solve these problems, the authors propose MDViT (Multi - domain Vision Transformer), which is a multi - domain ViT containing Domain Adapters, aiming to alleviate the data requirements and combat NKT by adaptively utilizing knowledge in multiple small - data resources (domains). Moreover, in order to enhance cross - domain representation learning, the authors also integrate a Mutual Knowledge Distillation paradigm, which transfers knowledge between the general network (covering all domains) and auxiliary domain - specific network branches. Through these innovations, MDViT can improve the segmentation performance on multiple skin lesion segmentation datasets while maintaining a fixed model size, even when more domains are added. The experimental results show that MDViT outperforms the existing state - of - the - art algorithms on four skin lesion segmentation datasets. In particular, on the skin cancer detection dataset, compared with Separate Training (ST), the IOU is increased by 10.16%.