Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders

Eyal Klang,Idit Tessler,Donald U Apakama,Ethan Abbott,Benjamin S Glicksberg,Monique Arnold,Akini Moses,Ankit Sakhuja,Ali Soroush,Alexander W Charney,David L Reich,Jolion McGreevy,Nicholas Gavin,Brendan Carr,Robert Freeman,Girish N Nadkarni
DOI: https://doi.org/10.1101/2024.10.15.24315526
2024-10-17
Abstract:Background: Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records. Methods: Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity. Findings: RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers' codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%. Interpretation: RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.
Health Informatics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to evaluate the performance of large language models (LLMs) using Retrieval-Augmented Generation (RAG) technology in generating International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) codes in the Emergency Department (ED) and compare it with the performance of manual coding by physicians. ### Background and Motivation 1. **Importance of Medical Coding**: Accurate medical coding is crucial for clinical records, public health surveillance, academic research, and medical reimbursement. 2. **Existing Issues**: The current ICD coding system, while important, is time-consuming and prone to errors, leading to incorrect or incomplete documentation. 3. **Automation Attempts**: Efforts to automate the extraction of medical codes using LLMs already exist, but these models often generate inaccurate or fictitious codes when only provided with code descriptions. 4. **Advantages of RAG Technology**: RAG technology significantly improves the accuracy of LLMs by combining generative capabilities with a retrieval component of externally verifiable data. ### Research Objectives 1. **Evaluate the Performance of RAG-Enhanced LLMs in ED ICD-10-CM Coding**: Specifically, the study aims to compare the accuracy and specificity of ICD-10-CM codes generated by RAG-enhanced LLMs with those manually assigned by physicians. 2. **Explore the Potential Applications of RAG Technology**: The study also explores the potential applications of RAG technology in improving medical coding accuracy, reducing administrative burden, and enhancing clinical documentation. ### Methods 1. **Study Design**: A retrospective cohort study analyzing data from 500 randomly selected ED patients from January to April 2024. 2. **Data Source**: Data were obtained using the electronic health record (EHR) system (EPIC system) of Mount Sinai Health System (MSHS). 3. **RAG Database**: Collected ICD codes and descriptions of all adult ED patients from 2021 to 2023 for training and testing the RAG system. 4. **Experimental Steps**: - Initial Code Prediction Prompt: Nine different LLMs predict the primary ICD-10-CM diagnosis codes and their descriptions based on clinical notes. - Retrieval-Augmented Generation (RAG): Query the database to identify the 10 most relevant ICD code descriptions. - Refined Prompt: Resubmit the 10 identified code descriptions to each LLM, combining ED notes to select the most appropriate code description and its associated code. - LLM Evaluation: Two large open-source LLMs (Llama-3.1-70B, Qwen-2-72B) independently compare the accuracy and specificity of all LLM-assigned and human-assigned codes. - Human Evaluation: Four reviewers (two attending emergency physicians and two emergency residents) assess the accuracy and specificity of GPT-4 generated codes versus manually assigned codes by physicians. ### Results 1. **Accuracy and Specificity**: RAG-enhanced LLMs demonstrated superior performance in code assignment accuracy and specificity compared to manual coding by physicians. 2. **Specific Performance**: - In 200 cases, human reviewers favored GPT-4's accuracy (447 times) and specificity (509 times) more than manual coding by physicians (277 times and 181 times). - Smaller open-access models (e.g., Llama-3.1-70B) also showed significant performance improvement after RAG enhancement. ### Discussion 1. **Advantages of RAG Technology**: RAG technology significantly improves coding accuracy in complex clinical language and ambiguous documentation by retrieving relevant contextual information in real-time. 2. **Potential Applications**: RAG-enhanced LLMs not only improve coding accuracy and specificity but also reduce administrative burden, enhance hospital operational efficiency, and improve patient outcomes. 3. **Limitations**: The model's performance may degrade over time, requiring continuous updates; ethical considerations and the risk of "black box" decision-making also need attention; the model's performance in different healthcare settings may vary.