MDACE: MIMIC Documents Annotated with Code Evidence

Hua Cheng,Rana Jafari,April Russell,Russell Klopfer,Edmond Lu,Benjamin Striner,Matthew R. Gormley
2023-07-08
Abstract:We introduce a dataset for evidence/rationale extraction on an extreme multi-label classification task over long medical documents. One such task is Computer-Assisted Coding (CAC) which has improved significantly in recent years, thanks to advances in machine learning technologies. Yet simply predicting a set of final codes for a patient encounter is insufficient as CAC systems are required to provide supporting textual evidence to justify the billing codes. A model able to produce accurate and reliable supporting evidence for each code would be a tremendous benefit. However, a human annotated code evidence corpus is extremely difficult to create because it requires specialized knowledge. In this paper, we introduce MDACE, the first publicly available code evidence dataset, which is built on a subset of the MIMIC-III clinical records. The dataset -- annotated by professional medical coders -- consists of 302 Inpatient charts with 3,934 evidence spans and 52 Profee charts with 5,563 evidence spans. We implemented several evidence extraction methods based on the EffectiveCAN model (Liu et al., 2021) to establish baseline performance on this dataset. MDACE can be used to evaluate code evidence extraction methods for CAC systems, as well as the accuracy and interpretability of deep learning models for multi-label classification. We believe that the release of MDACE will greatly improve the understanding and application of deep learning technologies for medical coding and document classification.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses two core issues: 1. **Creating a publicly available code evidence dataset**: The researchers constructed the first publicly available dataset, MD ACE (MIMIC Documents Annotated with Code Evidence), for evaluating code evidence extraction methods in computer-assisted coding (CAC) systems. This dataset is based on a subset of MIMIC-III clinical records and is annotated by professional medical coders. It includes evidence spans for diagnosis and procedure codes and their text offsets in the corresponding clinical notes. 2. **Establishing benchmark performance**: To establish benchmark performance on the newly created MD ACE dataset, the authors implemented several evidence extraction methods based on the EffectiveCAN model. These methods include unsupervised attention, supervised attention, linear tagging layer, and convolutional neural network (CNN) tagging layer. These methods allow for the evaluation of different models' abilities to extract effective evidence supporting the codes. In summary, the goal of this paper is to address the challenge of automatically extracting evidence supporting diagnosis and procedure codes from medical documents. To this end, the authors created a new dataset, MD ACE, and provided preliminary experimental results on this dataset to promote further research and development in this field.