Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Ashley Simmons,Kullaya Takkavatakarn,Megan McDougal,Brian Dilcher,Jami Pincavitch,Lukas Meadows,Justin Kauffman,Eyal Klang,Rebecca Wig,Gordon Smith,Ali Soroush,Robert Freeman,Donald J Apakama,Alexander Charney,Roopa Kohli-Seth,Girish Nadkarni,Ankit Sakhuja
DOI: https://doi.org/10.1101/2024.04.29.24306573
2024-11-23
Abstract:Background: Healthcare reimbursement and coding is dependent on accurate extraction of International Classification of Diseases- tenth revision clinical modification (ICD-10-CM) codes from clinical documentation. Attempts to automate this task have had limited success. This study aimed to evaluate the performance of large language models (LLMs) in extracting ICD-10-CM codes from unstructured inpatient notes and benchmark them against human coder. Methods: This study compared performance of GPT-3.5, GPT4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b in extracting ICD-10-CM codes from unstructured inpatient notes against a human coder. We presented deidentified inpatient notes from American Health Information Management Association Vlab authentic patient cases to LLMs and human coder for extraction of ICD-10-CM codes. We used a standard prompt for extracting ICD-10-CM codes. The human coder analyzed the same notes using 3M Encoder, adhering to the 2022- ICD-10-CM Coding Guidelines. Results: In this study, we analyzed 50 inpatient notes, comprising of 23 history and physicals and 27 progress notes. The human coder identified 165 unique codes with a median of 4 codes per note. The LLMs extracted varying numbers of median codes per note: GPT 3.5: 7, GPT4: 6, Claude 2.1: 6, Claude 3: 8, Gemini Advanced: 5, and Llama 2-70b:11. GPT 4 had the best performance though the agreement with human coder was poor at 15.2% for overall extraction of ICD-10-CM codes and 26.4% for extraction of category ICD-10-CM codes. Conclusion: Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against a human coder.
What problem does this paper attempt to address?