Multimodal Large Language Models are Generalist Medical Image Interpreters

Tianyu Han,Lisa C. Adams,Sven Nebelung,Jakob Nikolas Kather,Keno K. Bressem,Daniel Truhn
DOI: https://doi.org/10.1101/2023.12.21.23300146
2023-12-23
MedRxiv
Abstract:Advanced multimodal large language models (LLM), such as GPT-4V(ision) and Gemini Ultra, have shown promising results in the diagnosis of complex pathological conditions. This raises questions about their knowledge base: Do these models deeply understand medical cases, including images, or do they simply recognize superficial patterns from extensive pre-training? We aimed to determine whether LLMs can develop useable internal representations of images, and if these representations improve the classification of medical images. We rigorously tested the performance of the open-source Flamingo-80B model, which is not specifically tailored for medical tasks, against traditional pre-training methods. The tests covered eight distinct image classification tasks in pathology, dermatology, ophthalmology, and radiology, using CLIP, Flamingo-80B, and 9B multimodal models. These tasks ranged from tissue and nuclear classification in histopathology to lesion detection in dermatology and disease grading in radiology. We systematically evaluated the model's internal image representations to determine their relevance and usefulness in medical diagnosis. Our analysis showed that the internal representation of these images in the largest model, Flamingo-80B, was more accurate in classifying medical images than in all other methods. These results held even when the number of samples available for training was small. Our results show that multimodal LLMs acquire structured knowledge in medical domains. This suggests that these models are evolving from mere pattern recognition tools into entities with broader medical generalist capabilities. This evolution underscores the potential for these models to make contributions to medical diagnosis and research, although it is important to continue to evaluate their capabilities and limitations in real-world medical settings.
What problem does this paper attempt to address?