Abstract:Background: Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records. Methods: Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity. Findings: RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers' codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%. Interpretation: RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to evaluate the performance of large language models (LLMs) using Retrieval-Augmented Generation (RAG) technology in generating International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) codes in the Emergency Department (ED) and compare it with the performance of manual coding by physicians. ### Background and Motivation 1. **Importance of Medical Coding**: Accurate medical coding is crucial for clinical records, public health surveillance, academic research, and medical reimbursement. 2. **Existing Issues**: The current ICD coding system, while important, is time-consuming and prone to errors, leading to incorrect or incomplete documentation. 3. **Automation Attempts**: Efforts to automate the extraction of medical codes using LLMs already exist, but these models often generate inaccurate or fictitious codes when only provided with code descriptions. 4. **Advantages of RAG Technology**: RAG technology significantly improves the accuracy of LLMs by combining generative capabilities with a retrieval component of externally verifiable data. ### Research Objectives 1. **Evaluate the Performance of RAG-Enhanced LLMs in ED ICD-10-CM Coding**: Specifically, the study aims to compare the accuracy and specificity of ICD-10-CM codes generated by RAG-enhanced LLMs with those manually assigned by physicians. 2. **Explore the Potential Applications of RAG Technology**: The study also explores the potential applications of RAG technology in improving medical coding accuracy, reducing administrative burden, and enhancing clinical documentation. ### Methods 1. **Study Design**: A retrospective cohort study analyzing data from 500 randomly selected ED patients from January to April 2024. 2. **Data Source**: Data were obtained using the electronic health record (EHR) system (EPIC system) of Mount Sinai Health System (MSHS). 3. **RAG Database**: Collected ICD codes and descriptions of all adult ED patients from 2021 to 2023 for training and testing the RAG system. 4. **Experimental Steps**: - Initial Code Prediction Prompt: Nine different LLMs predict the primary ICD-10-CM diagnosis codes and their descriptions based on clinical notes. - Retrieval-Augmented Generation (RAG): Query the database to identify the 10 most relevant ICD code descriptions. - Refined Prompt: Resubmit the 10 identified code descriptions to each LLM, combining ED notes to select the most appropriate code description and its associated code. - LLM Evaluation: Two large open-source LLMs (Llama-3.1-70B, Qwen-2-72B) independently compare the accuracy and specificity of all LLM-assigned and human-assigned codes. - Human Evaluation: Four reviewers (two attending emergency physicians and two emergency residents) assess the accuracy and specificity of GPT-4 generated codes versus manually assigned codes by physicians. ### Results 1. **Accuracy and Specificity**: RAG-enhanced LLMs demonstrated superior performance in code assignment accuracy and specificity compared to manual coding by physicians. 2. **Specific Performance**: - In 200 cases, human reviewers favored GPT-4's accuracy (447 times) and specificity (509 times) more than manual coding by physicians (277 times and 181 times). - Smaller open-access models (e.g., Llama-3.1-70B) also showed significant performance improvement after RAG enhancement. ### Discussion 1. **Advantages of RAG Technology**: RAG technology significantly improves coding accuracy in complex clinical language and ambiguous documentation by retrieving relevant contextual information in real-time. 2. **Potential Applications**: RAG-enhanced LLMs not only improve coding accuracy and specificity but also reduce administrative burden, enhance hospital operational efficiency, and improve patient outcomes. 3. **Limitations**: The model's performance may degrade over time, requiring continuous updates; ethical considerations and the risk of "black box" decision-making also need attention; the model's performance in different healthcare settings may vary.

Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Evaluating Enhanced LLMs for Precise Mental Health Diagnosis from Clinical Notes

Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room

Large language models are good medical coders, if provided with tools

Retrieval-augmented large language models for clinical trial screening.

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Surpassing GPT-4 Medical Coding with a Two-Stage Approach

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Generative AI in Medicine and Healthcare: Moving Beyond the ‘peak of Inflated Expectations’

Enhancing Precision in Detecting Severe Immune-Related Adverse Events: Comparative Analysis of Large Language Models and International Classification of Disease Codes in Patient Records

Evaluating a Natural Language Processing-Driven, AI-Assisted International Classification of Diseases, 10th Revision, Clinical Modification, Coding System for Diagnosis Related Groups in a Real Hospital Environment: Algorithm Development and Validation Study

Evaluating the Efficacy of Large Language Models in CPT Coding for Craniofacial Surgery: A Comparative Analysis

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review

Identifying low acuity Emergency Department visits with a machine learning approach: The low acuity visit algorithms (LAVA)

The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls

Exploring LLM Multi-Agents for ICD Coding

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model