Abstract:Clinical notes are assigned ICD codes - sets of codes for diagnoses and procedures. In the recent years, predictive machine learning models have been built for automatic ICD coding. However, there is a lack of widely accepted benchmarks for automated ICD coding models based on large-scale public EHR data. This paper proposes a public benchmark suite for ICD-10 coding using a large EHR dataset derived from MIMIC-IV, the most recent public EHR dataset. We implement and compare several popular methods for ICD coding prediction tasks to standardize data preprocessing and establish a comprehensive ICD coding benchmark dataset. This approach fosters reproducibility and model comparison, accelerating progress toward employing automated ICD coding in future studies. Furthermore, we create a new ICD-9 benchmark using MIMIC-IV data, providing more data points and a higher number of ICD codes than MIMIC-III. Our open-source code offers easy access to data processing steps, benchmark creation, and experiment replication for those with MIMIC-IV access, providing insights, guidance, and protocols to efficiently develop ICD coding models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of a widely - accepted automated ICD coding benchmark test, especially for automated ICD coding models on large - scale public electronic health record (EHR) datasets. Specifically, most of the existing benchmark tests are based on the MIMIC - III dataset, which contains only about 9,000 ICD codes, far fewer than the number of codes in actual applications. In addition, most of the existing benchmark tests focus on ICD - 9 coding, while ICD - 10 coding lacks a widely - recognized benchmark test. To solve these problems, the paper proposes a new set of public benchmark tests using the MIMIC - IV dataset, which is the latest public EHR dataset and contains ten years of intensive care database data from 2008 to 2019. The MIMIC - IV dataset not only contains more documents and unique ICD codes, but also covers ICD - 9 and ICD - 10 coding. By implementing and comparing several popular ICD coding prediction methods, the paper aims to standardize the data pre - processing process, establish a comprehensive ICD coding benchmark dataset, thereby promoting the reproducibility of results and the comparison of models, and accelerating the application of automated ICD coding in future research. ### Main contributions 1. **New benchmark test dataset**: The paper proposes a new public benchmark test suite for ICD - 10 coding, based on the MIMIC - IV dataset. 2. **Standardization of data pre - processing**: Standardizes the data pre - processing process to ensure that different methods are compared under the same conditions. 3. **Extensive model comparison**: Implements and compares multiple popular methods and evaluates their performance in large - scale multi - label classification tasks. 4. **Open - source code**: Provides open - source code to facilitate researchers' access to data processing steps, generate benchmark tests, and reproduce experiments. ### Dataset statistics - **MIMIC - IV - ICD9**: - Number of documents: 209,359 - Average number of words per document: 1,460 - Average number of ICD codes per document: 13.4 - Total number of unique ICD codes: 11,331 - **MIMIC - IV - ICD10**: - Number of documents: 122,317 - Average number of words per document: 1,662 - Average number of ICD codes per document: 16.1 - Total number of unique ICD codes: 26,096 ### Experimental results The paper conducted experiments on two datasets, MIMIC - IV - ICD9 and MIMIC - IV - ICD10, to evaluate the performance of multiple baseline models. The main models include: - **CAML**: Convolutional Attention Network - **LAAT**: Label Attention Model - **JointLAAT**: Hierarchical Joint Learning Model - **MSMN**: Multi - Synonym Matching Network - **PLM - ICD**: ICD coding based on pre - trained language model The experimental results show that in the MIMIC - IV - ICD9 - Full setting, the PLM - ICD model performs best, while in the MIMIC - IV - ICD9 - 50 setting, the MSMN model performs best. For the MIMIC - IV - ICD10 - Full setting, the LAAT model performs best without relying on external data or knowledge, while the PLM - ICD model performs best when using external data or knowledge. ### Conclusion By proposing a new benchmark test dataset and a standardized data pre - processing process, the paper provides important resources and references for future research, which helps to accelerate the development and application of automated ICD coding technology.

Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification

Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset

Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset

Benchmark of Deep Learning Models on Large Healthcare MIMIC Datasets

Multi-label natural language processing to identify diagnosis and procedure codes from MIMIC-III inpatient notes

Benchmarking emergency department prediction models with machine learning and public electronic health records

Prediction of ICD Codes with Clinical BERT Embeddings and Text Augmentation with Label Balancing using MIMIC-III

A Label Attention Model for ICD Coding from Clinical Text

MIMIC-IV, a freely accessible electronic health record dataset

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Automatic Medical Code Assignment via Deep Learning Approach for Intelligent Healthcare

Benchmarking mortality risk prediction from electrocardiograms

Natural language processing of MIMIC-III clinical notes for identifying diagnosis and procedures with neural networks

LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction

TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding

An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records

Deep-ADCA: Development and Validation of Deep Learning Model for Automated Diagnosis Code Assignment Using Clinical Notes in Electronic Medical Records

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

PLM-ICD: Automatic ICD Coding with Pretrained Language Models

Rare Codes Count: Mining Inter-code Relations for Long-tail Clinical Text Classification