Abstract:Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the multilingual text embedding benchmarks (MTEB) Hugging Face resource, of various parameter sizes and architectures. Models were tested with real-world semantic retrieval medical tasks on (1) PubMed abstracts, (2) synthetic Electronic Health Records (EHRs) generated by the Llama-3-70b model, (3) real-world patient data from the Mount Sinai Health System, and the (4) MIMIC IV database. Tasks were split into Short Tasks, involving brief text pair interactions such as triage notes and chief complaints, and Long Tasks, which required processing extended documentation such as progress notes and history & physical notes. We assessed models by correlating their performance with data integrity levels, ranging from 0% (fully mismatched pairs) to 100% (perfectly matched pairs), using Spearman correlation. Additionally, we examined correlations between the average Spearman scores across tasks and two MTEB leaderboard benchmarks: the overall recorded average and the average Semantic Textual Similarity (STS) score. We evaluated 30 embedding models across seven clinical tasks (each involving 2,000 text pairs), across five levels of data integrity, totaling 2.1 million comparisons. Some models performed consistently well, while models based on Mistral-7b excelled in long-context tasks. NV-Embed-v1, despite being top performer in short tasks, did not perform as well in long tasks. Our average task performance score (ATPS) correlated better with the MTEB STS score (0.73) than with MTEB average score (0.67). The suggested framework is flexible, scalable and resistant to the risk of models overfitting on published benchmarks. Adopting this method can improve embedding technologies in healthcare.

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding

ScandEval: A Benchmark for Scandinavian Natural Language Processing

MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis

A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks

PL-MTEB: Polish Massive Text Embedding Benchmark

Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

German Text Embedding Clustering Benchmark

Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

SWEb: A Large Web Dataset for the Scandinavian Languages

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

Evaluating Large Language Models with Human Feedback: Establishing a Swedish Benchmark

Benchmarking pre-trained text embedding models in aligning built asset information

Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

Evaluation Benchmarks for Spanish Sentence Representations