Abstract:Objective To compare and release the diagnosis (ICD-10-CM), procedure (ICD-10-PCS), and medication (NDC) concept (code) embeddings trained by Latent Dirichlet Allocation (LDA), Word2Vec, GloVe, and BERT, for more efficient electronic health record (EHR) data analysis. Materials and Methods The embeddings were pre-trained by the four aforementioned models separately using the diagnosis, procedure, and medication information in MIMIC-IV. We interpreted the embeddings by visualizing them in 2D space and used the silhouette coefficient to assess the clustering ability of these embeddings. Furthermore, we evaluated the embeddings in three downstream tasks without fine-tuning: next visit diagnoses prediction, ICU patients mortality prediction, and medication recommendation. Results We found that embeddings pre-trained by GloVe have the best performance in the downstream tasks and the best interpretability for all diagnosis, procedure, and medication codes. In the next-visit diagnosis prediction, the accuracy of using GloVe embeddings was 12.2% higher than the baseline, which is the random generator. In the other two prediction tasks, GloVe improved the accuracy by 2%-3% over the baseline. LDA, Word2Vec, and BERT marginally improved the results over the baseline in most cases. Discussion and Conclusion GloVe shows superiority in mining diagnoses, procedures, and medications information of MIMIC-IV compared with LDA, Word2Vec, and BERT. Besides, we found that the granularity of training samples can affect the performance of models according to the downstream task and pre-train data. ### Competing Interest Statement The authors have declared no competing interest. ### Funding Statement This study did not receive any funding ### Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: Ethics committee/IRB of National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China gave ethical approval for this work Ethics committee/IRB of Oxford Suzhou Centre for Advanced Research, Suzhou, China gave ethical approval for this work Ethics committee/IRB of Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford, Oxford OX1 2JD, UK gave ethical approval for this work Ethics committee/IRB of Ethics committee/IRB of gave ethical approval for this work I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable. Yes All data produced in the present study are available upon reasonable request to the authors <https://bit.ly/3ONj9Su>

Evaluation of Embeddings of Laboratory Test Codes for Patients at a Cancer Center

Supervised embedding of textual predictors with applications in clinical diagnostics for pediatric cardiology

Language-model-based patient embedding using electronic health records facilitates phenotyping, disease forecasting, and progression analysis

Natural Language Processing to Identify Abnormal Breast, Lung, and Cervical Cancer Screening Test Results from Unstructured Reports to Support Timely Follow-up.

When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine

Leveraging Pre-trained and Transformer-derived Embeddings from EHRs to Characterize Heterogeneity Across Alzheimer's Disease and Related Dementias

Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data

Using text embedding models as text classifiers with medical data

Medical Provider Embeddings for Healthcare Fraud Detection

Incorporating informatively collected laboratory data from EHR in clinical prediction models

Comparing neural language models for medical concept representation and patient trajectory prediction

Concept Embedding for Relevance Detection of Search Queries Regarding CHOP

Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients

Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record

Leveraging graph-based hierarchical medical entity embedding for healthcare applications

Aggregation and Visualization of Laboratory Data by Using Ontological Tools Based on LOINC and SNOMED CT

Comparing natural language processing representations of coded disease sequences for prediction in electronic health records

A Study into patient similarity through representation learning from medical records

A Comparison of Representation Learning Methods for Medical Concepts in MIMIC-IV