Assessing large multimodal models for one-shot learning and interpretability in biomedical image classification

Wenpin Hou,Yilong Qu,Zhicheng Ji
DOI: https://doi.org/10.1101/2023.12.31.573796
2024-10-08
Abstract:Image classification plays a pivotal role in analyzing biomedical images, serving as a cornerstone for both biological research and clinical diagnostics. We demonstrate that large multimodal models (LMMs), like GPT-4, excel in one-shot learning, generalization, interpretability, and text-driven image classification across diverse biomedical tasks. These tasks include the classification of tissues, cell types, cellular states, and disease status. LMMs stand out from traditional single-modal classification approaches, which often require large training datasets and offer limited interpretability.
Bioinformatics
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are how to improve the one - shot learning, generalization ability and interpretability of the model in biomedical image classification. Specifically: 1. **One - shot learning ability**: Traditional single - modality deep - learning methods usually require a large amount of training data. However, in some cases (such as the diagnosis of rare diseases), obtaining a large amount of labeled data is difficult and time - consuming. Therefore, the paper evaluates the ability of large multi - modality models (LMMs) to perform effective classification with only a small number or a single training sample. 2. **Generalization ability**: Since biomedical images generated by different laboratories may have differences in conditions, equipment and experimental procedures, resulting in inconsistencies between the training set and the test set. The paper explores whether LMMs can better handle these differences and maintain high classification accuracy. 3. **Interpretability**: Deep - learning models are often regarded as "black boxes", and it is difficult to understand their decision - making processes. This is especially disadvantageous for clinical applications because doctors need transparent and trustworthy diagnostic tools. The paper studies whether LMMs can provide explanations for classification results through natural language, thereby enhancing the interpretability and trustworthiness of the model. To solve the above problems, the author systematically compares the performance of several leading commercial LMMs (such as GPT - 4o, Claude 3.5 Sonnet and Gemini 1.5 Pro) with traditional single - modality image classification methods in six different types of biomedical image classification tasks. The experimental results show that LMMs are significantly superior to single - modality methods in terms of one - shot learning, generalization ability and interpretability. ### Formula representation To ensure the correctness and readability of the formula, the following is an example of a formula involved in the paper: - **Accuracy calculation formula**: \[ \text{Accuracy}=\frac{\sum_{i,j}(p_{ij} = l_j)}{100} \] where \(p_{ij}\) represents the predicted binary label when the model is trained on the \(i\) - th reference image array and tested on the \(j\) - th test image; \(l_j\) represents the true binary label of the \(j\) - th test image.