A Framework for Evaluating the Efficacy of Foundation Embedding Models in Healthcare

Sonnet Xu,Haiwen Gui,Veronica Rotemberg,Tongzhou Wang,Yiqun T. Chen,Roxana Daneshjou
DOI: https://doi.org/10.1101/2024.04.17.24305983
2024-04-19
Abstract:Recent interest has surged in building large-scale foundation models for medical applications. In this paper, we propose a general framework for evaluating the efficacy of these foundation models in medicine, suggesting that they should be assessed across three dimensions: general performance, bias/fairness, and the influence of confounders. Utilizing Google’s recently released dermatology embedding model and lesion diagnostics as examples, we demonstrate that: 1) dermatology foundation models surpass state-of-the-art classification accuracy; 2) general-purpose CLIP models encode features informative for medical applications and should be more broadly considered as a baseline; 3) skin tone is a key differentiator for performance, and the potential bias associated with it needs to be quantified, monitored, and communicated; and 4) image quality significantly impacts model performance, necessitating that evaluation results across different datasets control for this variable. Our findings provide a nuanced view of the utility and limitations of large-scale foundation models for medical AI.
Health Informatics
What problem does this paper attempt to address?
This paper proposes a framework for evaluating the effectiveness of foundational embedding models in the medical field. The study focuses on three dimensions: general performance, bias/fairness, and the impact of confounding factors. Using the recently released Google dermatology embedding model and dermatological diagnosis as an example, the paper demonstrates the potential of large-scale foundational models in improving classification accuracy, serving as a general model, performance differences related to skin tone, and the influence of image quality on model performance. The paper emphasizes the need to quantify and monitor potential biases related to skin color and control for image quality variables in different datasets. The research results provide a deep understanding and awareness of the limitations of large-scale foundational models in the application of medical AI.