Abstract:Background: International Classification of Diseases codes are widely used to describe diagnosis information, but manual coding relies heavily on human interpretation, which can be expensive, time consuming, and prone to errors. With the transition from the International Classification of Diseases, Ninth Revision, to the International Classification of Diseases, Tenth Revision (ICD-10), the coding process has become more complex, highlighting the need for automated approaches to enhance coding efficiency and accuracy. Inaccurate coding can result in substantial financial losses for hospitals, and a precise assessment of outcomes generated by a natural language processing (NLP)-driven autocoding system thus assumes a critical role in safeguarding the accuracy of the Taiwan diagnosis related groups (Tw-DRGs). Objective: This study aims to evaluate the feasibility of applying an International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), autocoding system that can automatically determine diagnoses and codes based on free-text discharge summaries to facilitate the assessment of Tw-DRGs, specifically principal diagnosis and major diagnostic categories (MDCs). Methods: By using the patient discharge summaries from Kaohsiung Medical University Chung-Ho Memorial Hospital (KMUCHH) from April 2019 to December 2020 as a reference data set we developed artificial intelligence (AI)-assisted ICD-10-CM coding systems based on deep learning models. We constructed a web-based user interface for the AI-assisted coding system and deployed the system to the workflow of the certified coding specialists (CCSs) of KMUCHH. The data used for the assessment of Tw-DRGs were manually curated by a CCS with the principal diagnosis and MDC was determined from discharge summaries collected at KMUCHH from February 2023 to April 2023. Results: Both the reference data set and real hospital data were used to assess performance in determining ICD-10-CM coding, principal diagnosis, and MDC for Tw-DRGs. Among all methods, the GPT-2 (OpenAI)-based model achieved the highest F1-score, 0.667 (F1-score 0.851 for the top 50 codes), on the KMUCHH test set and a slightly lower F1-score, 0.621, in real hospital data. Cohen κ evaluation for the agreement of MDC between the models and the CCS revealed that the overall average κ value for GPT-2 (κ=0.714) was approximately 12.2 percentage points higher than that of the hierarchy attention network (κ=0.592). GPT-2 demonstrated superior agreement with the CCS across 6 categories of MDC, with an average κ value of approximately 0.869 (SD 0.033), underscoring the effectiveness of the developed AI-assisted coding system in supporting the work of CCSs. Conclusions: An NLP-driven AI-assisted coding system can assist CCSs in ICD-10-CM coding by offering coding references via a user interface, demonstrating the potential to reduce the manual workload and expedite Tw-DRG assessment. Consistency in performance affirmed the effectiveness of the system in supporting CCSs in ICD-10-CM coding and the judgment of Tw-DRGs.

Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset

Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification

Natural language processing of MIMIC-III clinical notes for identifying diagnosis and procedures with neural networks

Multi-label natural language processing to identify diagnosis and procedure codes from MIMIC-III inpatient notes

A Systematic Literature Review of Automated ICD Coding and Classification Systems using Discharge Summaries

An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records

The MIMIC Code Repository: enabling reproducibility in critical care research

A Multidatabase ExTRaction PipEline (METRE) for Facile Cross Validation in Critical Care Research

Modelling Conditions and Health Care Processes in Electronic Health Records: An Application to Severe Mental Illness with the Clinical Practice Research Datalink

Improving ICD coding using Chapter based Named Entities and Attentional Models

Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset

Exploring the Consistency, Quality and Challenges in Manual and Automated Coding of Free-text Diagnoses from Hospital Outpatient Letters

A Comparative Study on Automatic Coding of Medical Letters with Explainability

A Scalable Workflow to Build Machine Learning Classifiers with Clinician-in-the-Loop to Identify Patients in Specific Diseases

MIMIC-IV, a freely accessible electronic health record dataset

Evaluating a Natural Language Processing-Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study

Eliciting Motivational Interviewing Skill Codes in Psychotherapy with LLMs: A Bilingual Dataset and Analytical Study

A Survey of Automated ICD Coding: Development, Challenges, and Applications

MDACE: MIMIC Documents Annotated with Code Evidence

Evaluating a Natural Language Processing–Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study