Abstract:Background: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination. Objective: This study aims to establish a contextual language model with rule-based preprocessing methods to develop the model for ICD-10 multilabel classification. Methods: We retrieved electronic health records from a medical center. We first compared different word embedding methods. Second, we compared the preprocessing methods using the best-performing embeddings. We compared biomedical bidirectional encoder representations from transformers (BioBERT), clinical generalized autoregressive pretraining for language understanding (Clinical XLNet), label tree-based attention-aware deep model for high-performance extreme multilabel text classification (AttentionXLM), and word-to-vector (Word2Vec) to predict ICD-10-CM. To compare different preprocessing methods for ICD-10-CM, we included DD, medical history, and comorbidity and complication as inputs. We compared the performance of ICD-10-CM prediction using different preprocesses, including definition training, external cause code removal, number conversion, and combination code filtering. For the ICD-10 PCS, the model was trained using different combinations of DD, surgical method, and key words of special examination. The micro F 1 score and the micro area under the receiver operating characteristic curve were used to compare the model's performance with that of different preprocessing methods. Results: BioBERT had an F 1 score of 0.701 and outperformed other models such as Clinical XLNet, AttentionXLM, and Word2Vec. For the ICD-10-CM, the model had an F 1 score that significantly increased from 0.749 (95% CI 0.744-0.753) to 0.769 (95% CI 0.764-0.773) with the ICD-10 definition training, external cause code removal, number conversion, and combination code filter. For the ICD-10-PCS, the model had an F 1 score that significantly increased from 0.670 (95% CI 0.663-0.678) to 0.726 (95% CI 0.719-0.732) with a combination of discharge diagnoses, surgical methods, and key words of special examination. With our preprocessing methods, the model had the highest area under the receiver operating characteristic curve of 0.853 (95% CI 0.849-0.855) and 0.831 (95% CI 0.827-0.834) for ICD-10-CM and ICD-10-PCS, respectively. Conclusions: The performance of our model with the pretrained contextualized language model and rule-based preprocessing method is better than that of the state-of-the-art model for ICD-10-CM or ICD-10-PCS. This study highlights the importance of rule-based preprocessing methods based on coder coding rules.

Automated ICD coding for primary diagnosis via clinically interpretable machine learning

A Hybrid Method for Icd-10 Auto-Coding of Chinese Diagnoses

Automatic ICD-10 Coding Algorithm Using an Improved Longest Common Subsequence Based on Semantic Similarity

Temporal changes in breast cancer incidence in South Asian women.

Automatic ICD-10 Coding and Training System: Deep Neural Network Based on Supervised Learning

Evaluating a Natural Language Processing–Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study

Automatic ICD-10 coding: Deep semantic matching based on analogical reasoning

Towards Automated ICD Coding Using Deep Learning

A Hierarchical Method to Automatically Encode Chinese Diagnoses Through Semantic Similarity Estimation

Modelling long medical documents and code associations for explainable automatic ICD coding

A Survey of Automated ICD Coding: Development, Challenges, and Applications

Deep-ADCA: Development and Validation of Deep Learning Model for Automated Diagnosis Code Assignment Using Clinical Notes in Electronic Medical Records

A Survey of Automated International Classification of Diseases Coding: Development, Challenges, and Applications

Australia's notifiable disease status, 2009: annual report of the National Notifiable Diseases Surveillance System.

Automatic Medical Code Assignment via Deep Learning Approach for Intelligent Healthcare

Enhanced ICD-10 code assignment of clinical texts: A summarization-based approach

An Application of A Computer Aided ICD-10 Coding System

Combining transformer-based model and GCN to predict ICD codes from clinical records

Insulator surface charge accumulation under impulse voltage

Learning from Undercoded Clinical Records for Automated International Classification of Diseases (ICD) Coding

Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model With Rule-Based Approaches