Abstract:Image classification plays a pivotal role in analyzing biomedical images, serving as a cornerstone for both biological research and clinical diagnostics. We demonstrate that large multimodal models (LMMs), like GPT-4, excel in one-shot learning, generalization, interpretability, and text-driven image classification across diverse biomedical tasks. These tasks include the classification of tissues, cell types, cellular states, and disease status. LMMs stand out from traditional single-modal classification approaches, which often require large training datasets and offer limited interpretability.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are how to improve the one - shot learning, generalization ability and interpretability of the model in biomedical image classification. Specifically: 1. **One - shot learning ability**: Traditional single - modality deep - learning methods usually require a large amount of training data. However, in some cases (such as the diagnosis of rare diseases), obtaining a large amount of labeled data is difficult and time - consuming. Therefore, the paper evaluates the ability of large multi - modality models (LMMs) to perform effective classification with only a small number or a single training sample. 2. **Generalization ability**: Since biomedical images generated by different laboratories may have differences in conditions, equipment and experimental procedures, resulting in inconsistencies between the training set and the test set. The paper explores whether LMMs can better handle these differences and maintain high classification accuracy. 3. **Interpretability**: Deep - learning models are often regarded as "black boxes", and it is difficult to understand their decision - making processes. This is especially disadvantageous for clinical applications because doctors need transparent and trustworthy diagnostic tools. The paper studies whether LMMs can provide explanations for classification results through natural language, thereby enhancing the interpretability and trustworthiness of the model. To solve the above problems, the author systematically compares the performance of several leading commercial LMMs (such as GPT - 4o, Claude 3.5 Sonnet and Gemini 1.5 Pro) with traditional single - modality image classification methods in six different types of biomedical image classification tasks. The experimental results show that LMMs are significantly superior to single - modality methods in terms of one - shot learning, generalization ability and interpretability. ### Formula representation To ensure the correctness and readability of the formula, the following is an example of a formula involved in the paper: - **Accuracy calculation formula**: \[ \text{Accuracy}=\frac{\sum_{i,j}(p_{ij} = l_j)}{100} \] where \(p_{ij}\) represents the predicted binary label when the model is trained on the \(i\) - th reference image array and tested on the \(j\) - th test image; \(l_j\) represents the true binary label of the \(j\) - th test image.

Assessing large multimodal models for one-shot learning and interpretability in biomedical image classification

Multimodal Large Language Models for Bioimage Analysis

Multimodal Large Language Models are Generalist Medical Image Interpreters

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging

Comparison of Multi-Modal Large Language Models with Deep Learning Models for Medical Image Classification

Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Exploring the Capabilities of Large Multimodal Models on Dense Text

Multimodal Foundation Models Exploit Text to Make Medical Image Predictions

Evaluating General Vision-Language Models for Clinical Medicine

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

Multi-modal large language models in radiology: principles, applications, and potential

Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4

Few-shot medical image classification with simple shape and texture text descriptors using vision-language models

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Multimodal LLMs for Retinal Disease Diagnosis via OCT: Few-Shot vs Single-Shot Learning