Biomedical Text Normalization through Generative Modeling
Jacob S. Berkowitz,Yasaman Fatapour,Jose Miguel Acitores Cortina,Apoorva Srinivasan,Nicholas P Tatonetti
DOI: https://doi.org/10.1101/2024.09.30.24314663
2024-10-01
Abstract:ABSTRACT
Objective: Around 80% of electronic health record (EHR) data consists of unstructured medical language text. By its nature, this text is flexible and inconsistent, making it challenging to use for clinical trial matching, decision support, and predictive modeling. In this study, we develop and assess text normalization pipelines built using large-language models.
Materials and Methods: Here, we evaluated four LLM-based normalization strategies: Zero-Shot Recall, Prompt Recall, Semantic Search, and Retrieval-Augmented Generation (RAG) and one baseline, TF-IDF based String Matching. We compared normalization performance across two datasets of condition terms mapped to SNOMED, one tailored to oncology, and one covering a wide range of medical conditions. Additionally, we benchmarked our models against the TAC 2017 drug label annotations, which normalizes terms to the Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms.
Results: RAG, which effectively combines Prompt Recall and Semantic Search, was the most effective, accurately identifying the correct term 88.31% of the time for the domain-specific dataset and 79.97% for the broader dataset. Our model achieved a micro F1 score of 88.01 on task 4 of the TAC2017 conference, surpassing all other models without relying on the provided training data.
Discussion: These findings demonstrate the potential of LLMs in medical text normalization. We find that retrieval-focused approaches overcome traditional LLM limitations for this task.
Conclusion: Large language models combined with retrieval-augmented generation should be explored for text normalization of biomedical free text.
Health Informatics