Abstract:Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the multilingual text embedding benchmarks (MTEB) Hugging Face resource, of various parameter sizes and architectures. Models were tested with real-world semantic retrieval medical tasks on (1) PubMed abstracts, (2) synthetic Electronic Health Records (EHRs) generated by the Llama-3-70b model, (3) real-world patient data from the Mount Sinai Health System, and the (4) MIMIC IV database. Tasks were split into Short Tasks, involving brief text pair interactions such as triage notes and chief complaints, and Long Tasks, which required processing extended documentation such as progress notes and history & physical notes. We assessed models by correlating their performance with data integrity levels, ranging from 0% (fully mismatched pairs) to 100% (perfectly matched pairs), using Spearman correlation. Additionally, we examined correlations between the average Spearman scores across tasks and two MTEB leaderboard benchmarks: the overall recorded average and the average Semantic Textual Similarity (STS) score. We evaluated 30 embedding models across seven clinical tasks (each involving 2,000 text pairs), across five levels of data integrity, totaling 2.1 million comparisons. Some models performed consistently well, while models based on Mistral-7b excelled in long-context tasks. NV-Embed-v1, despite being top performer in short tasks, did not perform as well in long tasks. Our average task performance score (ATPS) correlated better with the MTEB STS score (0.73) than with MTEB average score (0.67). The suggested framework is flexible, scalable and resistant to the risk of models overfitting on published benchmarks. Adopting this method can improve embedding technologies in healthcare.

A Framework for Evaluating the Efficacy of Foundation Embedding Models in Healthcare

When is a Foundation Model a Foundation Model

Foundation AI Model for Medical Image Segmentation

Towards Scalable Foundation Models for Digital Dermatology

On the Challenges and Perspectives of Foundation Models for Medical Image Analysis

The Promises and Perils of Foundation Models in Dermatology

The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs

Foundation Models in Radiology: What, How, When, Why and Why Not

A Comprehensive Survey of Foundation Models in Medicine

A Clinical Benchmark of Public Self-Supervised Pathology Foundation Models

Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: Applications, Challenges, and Future Directions

A General-Purpose Multimodal Foundation Model for Dermatology

Exploring Foundation Models for Synthetic Medical Imaging: A Study on Chest X-Rays and Fine-Tuning Techniques

Foundation model for cancer imaging biomarkers

MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification

Are Natural Domain Foundation Models Useful for Medical Image Classification?

The shaky foundations of large language models and foundation models for electronic health records

Fostering transparent medical image AI via an image-text foundation model grounded in medical literature

A multi-center study on the adaptability of a shared foundation model for electronic health records

How Good Are We? Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment

A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks