Abstract:Objective To compare and release the diagnosis (ICD-10-CM), procedure (ICD-10-PCS), and medication (NDC) concept (code) embeddings trained by Latent Dirichlet Allocation (LDA), Word2Vec, GloVe, and BERT, for more efficient electronic health record (EHR) data analysis. Materials and Methods The embeddings were pre-trained by the four aforementioned models separately using the diagnosis, procedure, and medication information in MIMIC-IV. We interpreted the embeddings by visualizing them in 2D space and used the silhouette coefficient to assess the clustering ability of these embeddings. Furthermore, we evaluated the embeddings in three downstream tasks without fine-tuning: next visit diagnoses prediction, ICU patients mortality prediction, and medication recommendation. Results We found that embeddings pre-trained by GloVe have the best performance in the downstream tasks and the best interpretability for all diagnosis, procedure, and medication codes. In the next-visit diagnosis prediction, the accuracy of using GloVe embeddings was 12.2% higher than the baseline, which is the random generator. In the other two prediction tasks, GloVe improved the accuracy by 2%-3% over the baseline. LDA, Word2Vec, and BERT marginally improved the results over the baseline in most cases. Discussion and Conclusion GloVe shows superiority in mining diagnoses, procedures, and medications information of MIMIC-IV compared with LDA, Word2Vec, and BERT. Besides, we found that the granularity of training samples can affect the performance of models according to the downstream task and pre-train data. ### Competing Interest Statement The authors have declared no competing interest. ### Funding Statement This study did not receive any funding ### Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: Ethics committee/IRB of National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China gave ethical approval for this work Ethics committee/IRB of Oxford Suzhou Centre for Advanced Research, Suzhou, China gave ethical approval for this work Ethics committee/IRB of Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford, Oxford OX1 2JD, UK gave ethical approval for this work Ethics committee/IRB of Ethics committee/IRB of gave ethical approval for this work I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable. Yes All data produced in the present study are available upon reasonable request to the authors <https://bit.ly/3ONj9Su>

Enhancing Automated Medical Coding: Evaluating Embedding Models for ICD-10-CM Code Mapping

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study

A Comparison of Representation Learning Methods for Medical Concepts in MIMIC-IV

Improving ICD coding using Chapter based Named Entities and Attentional Models

Modelling long medical documents and code associations for explainable automatic ICD coding

Comparison of different feature extraction methods for applicable automated ICD coding

Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches

Towards BERT-based Automatic ICD Coding: Limitations and Opportunities

A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks

Accurate and Well-Calibrated ICD Code Assignment Through Attention Over Diverse Label Embeddings

Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders

What Kind of Transformer Models to Use for the ICD-10 Codes Classification Task

Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models

Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

Generalist embedding models are better at short-context clinical semantic search than specialized embedding models

TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding

Automated ICD coding using extreme multi-label long text transformer-based models

Automatic ICD-10 coding: Deep semantic matching based on analogical reasoning

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Large Language Model in Medical Informatics: Direct Classification and Enhanced Text Representations for Automatic ICD Coding

A Label Attention Model for ICD Coding from Clinical Text