A Comparison of Representation Learning Methods for Medical Concepts in MIMIC-IV
Xuan Wu,Yizheng Zhao,Yang Yang,Zhangdaihong Liu,David A. Clifton
DOI: https://doi.org/10.1101/2022.08.21.22278835
2022-01-01
Abstract:Objective To compare and release the diagnosis (ICD-10-CM), procedure (ICD-10-PCS), and medication (NDC) concept (code) embeddings trained by Latent Dirichlet Allocation (LDA), Word2Vec, GloVe, and BERT, for more efficient electronic health record (EHR) data analysis.
Materials and Methods The embeddings were pre-trained by the four aforementioned models separately using the diagnosis, procedure, and medication information in MIMIC-IV. We interpreted the embeddings by visualizing them in 2D space and used the silhouette coefficient to assess the clustering ability of these embeddings. Furthermore, we evaluated the embeddings in three downstream tasks without fine-tuning: next visit diagnoses prediction, ICU patients mortality prediction, and medication recommendation.
Results We found that embeddings pre-trained by GloVe have the best performance in the downstream tasks and the best interpretability for all diagnosis, procedure, and medication codes. In the next-visit diagnosis prediction, the accuracy of using GloVe embeddings was 12.2% higher than the baseline, which is the random generator. In the other two prediction tasks, GloVe improved the accuracy by 2%-3% over the baseline. LDA, Word2Vec, and BERT marginally improved the results over the baseline in most cases.
Discussion and Conclusion GloVe shows superiority in mining diagnoses, procedures, and medications information of MIMIC-IV compared with LDA, Word2Vec, and BERT. Besides, we found that the granularity of training samples can affect the performance of models according to the downstream task and pre-train data.
### Competing Interest Statement
The authors have declared no competing interest.
### Funding Statement
This study did not receive any funding
### Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics committee/IRB of National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China gave ethical approval for this work Ethics committee/IRB of Oxford Suzhou Centre for Advanced Research, Suzhou, China gave ethical approval for this work Ethics committee/IRB of Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford, Oxford OX1 2JD, UK gave ethical approval for this work Ethics committee/IRB of Ethics committee/IRB of gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
All data produced in the present study are available upon reasonable request to the authors
<https://bit.ly/3ONj9Su>