Can Large Language Models abstract Medical Coded Language?

Simon A. Lee,Timothy Lindsey
2024-06-07
Abstract:Large Language Models (LLMs) have become a pivotal research area, potentially making beneficial contributions in fields like healthcare where they can streamline automated billing and decision support. However, the frequent use of specialized coded languages like ICD-10, which are regularly updated and deviate from natural language formats, presents potential challenges for LLMs in creating accurate and meaningful latent representations. This raises concerns among healthcare professionals about potential inaccuracies or ``hallucinations" that could result in the direct impact of a patient. Therefore, this study evaluates whether large language models (LLMs) are aware of medical code ontologies and can accurately generate names from these codes. We assess the capabilities and limitations of both general and biomedical-specific generative models, such as GPT, LLaMA-2, and Meditron, focusing on their proficiency with domain-specific terminologies. While the results indicate that LLMs struggle with coded language, we offer insights on how to adapt these models to reason more effectively.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the capabilities and limitations of large language models (LLMs) in handling medical coding language. Specifically, the researchers evaluated whether these models can understand medical coding ontologies and accurately generate corresponding names. The models used in the study include general models and biomedical-specific generative models, such as GPT, LLaMA-2, and Meditron. The core questions of the paper are: - Can large language models accurately generate the correct labels from medical codes? - Do these models exhibit "hallucinations" or errors when dealing with standardized medical coding (such as ICD-10)? The study evaluates the models' performance through a series of experiments, including predicting medical chapter names, generating medical code names, and adversarial attack experiments. The results indicate that although some models (like GPT-4) perform relatively well, large language models generally face significant difficulties in handling medical coding, particularly in distinguishing between real and fake codes. Additionally, the study found that the models perform better with common codes than with rare ones. The paper concludes by proposing several potential solutions, including the use of knowledge graphs, enhancing reasoning capabilities, and generating synthetic data, to improve the performance of large language models in handling medical coding.