Classifying medical notes into standard disease codes using Machine Learning

Amitabha Karmakar
DOI: https://doi.org/10.48550/arXiv.1802.00382
2018-02-02
Abstract:We investigate the automatic classification of patient discharge notes into standard disease labels. We find that Convolutional Neural Networks with Attention outperform previous algorithms used in this task, and suggest further areas for improvement.
Machine Learning,Computation and Language,Applications
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of automatically classifying patient discharge summaries into standard disease codes (ICD - 9 codes). Specifically, the authors use electronic health records (EHRs) in the MIMIC III database and attempt to improve the accuracy and efficiency of automatic classification through different deep - learning models such as Convolutional Neural Network (CNN), Long - Short - Term Memory Network (LSTM) and Attention mechanism. #### Background and Motivation 1. **Growth of Electronic Health Records (EHRs)**: In recent years, EHRs contain a large amount of patient information, including structured data (such as admission date) and unstructured data (such as doctor's notes). These records contain valuable information that can be used for faster epidemic detection, symptom identification, personalized treatment, etc. 2. **Problems with Manual ICD Code Labeling**: Since 1967, the World Health Organization (WHO) has developed the International Classification of Diseases (ICD) system for monitoring the incidence and prevalence of diseases, observing reimbursement and resource allocation trends, and tracking safety and quality guidelines. Currently, ICD labels are annotated manually according to definitions, which are susceptible to interpretation and errors. 3. **Automation Requirement**: In order to improve the automation and accuracy of disease reporting, researchers have begun to explore methods for automatically annotating ICD codes. #### Research Objectives 1. **Automatically Classify Discharge Summaries**: Use the data in the MIMIC III database to automatically classify discharge summaries into ICD - 9 codes. 2. **Improve Existing Methods**: Evaluate the performance of different deep - learning models (such as CNN, LSTM, and Attention) in this task and make improvement suggestions. #### Method Overview 1. **Dataset**: Use 53,000 discharge summaries of 41,000 patients in the MIMIC III database. 2. **Pre - processing**: Standardize the text, including operations such as converting to lowercase, removing special characters, and word segmentation. 3. **Model Selection**: - **CNN**: Suitable for capturing local features, but has limited memory ability for long texts. - **LSTM**: Able to process time - series data, but has many parameters and may lead to over - fitting. - **Attention**: Helps the model focus on important parts and is especially suitable for long - text classification. #### Main Contributions 1. **Performance Improvement**: The research shows that the CNN model with attention mechanism significantly outperforms other models in the F1 - score, reaching an F1 - score of 72.8%. 2. **Performance on Large - Scale Datasets**: On the complete 52,600 records, the pure CNN model has an F1 - score of 79.7%, exceeding previous work. 3. **Future Directions**: Proposed further research directions such as optimizing the CNN model, improving the embedding layer to adapt to clinical notes, and gradually increasing the number of ICD codes. In conclusion, this paper is committed to achieving more efficient and accurate automatic annotation of ICD codes through deep - learning techniques, thereby improving the automation level of medical record processing.