PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification

Zhenwei Wang,Qiule Sun,Bingbing Zhang,Pengfei Wang,Jianxin Zhang,Qiang Zhang
2024-05-25
Abstract:Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the training challenges in medical image classification due to the scarcity of annotated data. Specifically, the paper focuses on how to improve the performance of medical image classification by introducing multimodal information in scenarios with only a small number of annotated samples (i.e., few-shot learning scenarios). Traditional methods typically rely on a large amount of annotated data, which is impractical in the medical field because obtaining a large number of high-quality annotated medical images is very expensive and time-consuming. Therefore, this paper proposes a new paradigm of multimodal prompt model (PM2), aiming to use text prompts as supplementary information, combined with image modality, to enhance the effect of few-shot learning. ### Main Contributions 1. **Proposing a new paradigm of multimodal prompt model (PM2)**: - This paradigm introduces text prompts as supplementary training samples or modalities for the first time, used to describe images or concept categories. 2. **In-depth study of five text prompt schemes**: - Through experiments, the impact of different text prompt schemes on the few-shot medical image classification task is evaluated. These prompt schemes include class names, simple prompts, manually designed prompts, descriptions generated by GPT, and learnable CoOp methods. 3. **Introducing a new visual classification head**: - Based on the visual encoder, a new visual classification head is proposed, which considers not only the first-order statistics (class labels) of image features but also the second-order statistics (covariance of visual labels) to generate stronger image representations. 4. **Extensive experimental validation**: - Detailed ablation studies were conducted on three medical image datasets. The experimental results show that PM2 significantly outperforms other methods in few-shot learning scenarios and achieves state-of-the-art performance. ### Method Overview 1. **Review of CLIP**: - This paper is based on the pre-trained multimodal model CLIP, which includes a text encoder and a visual encoder. CLIP learns powerful image and text representations from image-text pairs through contrastive learning. 2. **Overall structure of PM2**: - PM2 uses CLIP as the foundation, containing two encoders: an image encoder and a text encoder. The inputs include medical images and their corresponding category descriptions. The image encoder extracts visual features, and the text encoder extracts text features. The classification head combines the first-order and second-order statistics of visual features for prediction. 3. **Text prompts**: - This paper explores five different text prompt methods, including class names, simple prompts, manually designed prompts, descriptions generated by GPT, and learnable CoOp methods. These prompt methods aim to provide rich textual information to help the model better understand the image content. 4. **Visual classification head**: - A new visual classification head is proposed, which considers not only class labels (global image representation) but also the covariance of visual labels (second-order statistics) to generate stronger image representations. This classification head predicts the feature distribution through a linear classifier, thereby improving classification performance. ### Conclusion By introducing multimodal information and a new visual classification head, PM2 achieves significant performance improvements in few-shot medical image classification tasks. This method not only addresses the problem of scarce annotated data but also provides new ideas for future few-shot learning research.