Text-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification

Sirui Li,Li Lin,Yijin Huang,Pujin Cheng,Xiaoying Tang
2024-08-27
Abstract:In medical contexts, the imbalanced data distribution in long-tailed datasets, due to scarce labels for rare diseases, greatly impairs the diagnostic accuracy of deep learning models. Recent multimodal text-image supervised foundation models offer new solutions to data scarcity through effective representation learning. However, their limited medical-specific pretraining hinders their performance in medical image classification relative to natural images. To address this issue, we propose a novel Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). We adopt a two-stage training strategy, integrating representations from the foundation model using just two linear adapters and a single ensembler for balanced outcomes. Experimental results on two long-tailed medical image datasets validate the simplicity, lightweight and efficiency of our approach: requiring only 6.1% GPU memory usage of the current best-performing algorithm, our method achieves an accuracy improvement of up to 27.1%, highlighting the substantial potential of foundation model adaptation in this area.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of data imbalance in long-tailed medical image classification. Specifically, due to the scarcity of labels for rare diseases, actual medical datasets often exhibit a long-tailed distribution, causing deep learning models to be biased towards common categories, thereby affecting the diagnostic accuracy of critical rare conditions. To tackle this challenge, the paper proposes a novel framework—Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). This method employs a two-stage training strategy, leveraging the representation learning capabilities of foundation models and combining richer associative representations in the text space to enhance the performance of long-tailed medical image classification. Experimental results demonstrate that this method achieves significant accuracy improvements on two long-tailed medical image datasets and exhibits extremely high computational efficiency, requiring only 6.1% of the GPU memory usage of the current best algorithm. This indicates the great potential of foundation model adaptation in handling long-tailed tasks.