Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification

Thanh-Tung Nguyen,Viktor Schlegel,Abhinav Kashyap,Stefan Winkler,Shao-Syuan Huang,Jie-Jyun Liu,Chih-Jen Lin
2023-04-27
Abstract:Clinical notes are assigned ICD codes - sets of codes for diagnoses and procedures. In the recent years, predictive machine learning models have been built for automatic ICD coding. However, there is a lack of widely accepted benchmarks for automated ICD coding models based on large-scale public EHR data. This paper proposes a public benchmark suite for ICD-10 coding using a large EHR dataset derived from MIMIC-IV, the most recent public EHR dataset. We implement and compare several popular methods for ICD coding prediction tasks to standardize data preprocessing and establish a comprehensive ICD coding benchmark dataset. This approach fosters reproducibility and model comparison, accelerating progress toward employing automated ICD coding in future studies. Furthermore, we create a new ICD-9 benchmark using MIMIC-IV data, providing more data points and a higher number of ICD codes than MIMIC-III. Our open-source code offers easy access to data processing steps, benchmark creation, and experiment replication for those with MIMIC-IV access, providing insights, guidance, and protocols to efficiently develop ICD coding models.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of a widely - accepted automated ICD coding benchmark test, especially for automated ICD coding models on large - scale public electronic health record (EHR) datasets. Specifically, most of the existing benchmark tests are based on the MIMIC - III dataset, which contains only about 9,000 ICD codes, far fewer than the number of codes in actual applications. In addition, most of the existing benchmark tests focus on ICD - 9 coding, while ICD - 10 coding lacks a widely - recognized benchmark test. To solve these problems, the paper proposes a new set of public benchmark tests using the MIMIC - IV dataset, which is the latest public EHR dataset and contains ten years of intensive care database data from 2008 to 2019. The MIMIC - IV dataset not only contains more documents and unique ICD codes, but also covers ICD - 9 and ICD - 10 coding. By implementing and comparing several popular ICD coding prediction methods, the paper aims to standardize the data pre - processing process, establish a comprehensive ICD coding benchmark dataset, thereby promoting the reproducibility of results and the comparison of models, and accelerating the application of automated ICD coding in future research. ### Main contributions 1. **New benchmark test dataset**: The paper proposes a new public benchmark test suite for ICD - 10 coding, based on the MIMIC - IV dataset. 2. **Standardization of data pre - processing**: Standardizes the data pre - processing process to ensure that different methods are compared under the same conditions. 3. **Extensive model comparison**: Implements and compares multiple popular methods and evaluates their performance in large - scale multi - label classification tasks. 4. **Open - source code**: Provides open - source code to facilitate researchers' access to data processing steps, generate benchmark tests, and reproduce experiments. ### Dataset statistics - **MIMIC - IV - ICD9**: - Number of documents: 209,359 - Average number of words per document: 1,460 - Average number of ICD codes per document: 13.4 - Total number of unique ICD codes: 11,331 - **MIMIC - IV - ICD10**: - Number of documents: 122,317 - Average number of words per document: 1,662 - Average number of ICD codes per document: 16.1 - Total number of unique ICD codes: 26,096 ### Experimental results The paper conducted experiments on two datasets, MIMIC - IV - ICD9 and MIMIC - IV - ICD10, to evaluate the performance of multiple baseline models. The main models include: - **CAML**: Convolutional Attention Network - **LAAT**: Label Attention Model - **JointLAAT**: Hierarchical Joint Learning Model - **MSMN**: Multi - Synonym Matching Network - **PLM - ICD**: ICD coding based on pre - trained language model The experimental results show that in the MIMIC - IV - ICD9 - Full setting, the PLM - ICD model performs best, while in the MIMIC - IV - ICD9 - 50 setting, the MSMN model performs best. For the MIMIC - IV - ICD10 - Full setting, the LAAT model performs best without relying on external data or knowledge, while the PLM - ICD model performs best when using external data or knowledge. ### Conclusion By proposing a new benchmark test dataset and a standardized data pre - processing process, the paper provides important resources and references for future research, which helps to accelerate the development and application of automated ICD coding technology.