Enhancing Clinical Data Extraction from Pathology Reports: A Comparative Analysis of Large Language Models

Sunghyeon Park,Wona Choi,InYoung Choi
DOI: https://doi.org/10.3233/SHTI240523
2024-08-22
Abstract:This study evaluates the efficacy of a small large language model (sLLM) in extracting critical information from free-text pathology reports across multiple centers, addressing the challenges posed by the narrative and complex nature of these documents. Employing three variants of the Llama 2 model, with 7 billion, 13 billion, and 70 billion parameters, the research assesses model performance in both zero-shot and five-shot settings, offering insights into the impact of example-based learning. A specialized information extraction tool utilizing regular expressions for pattern identification serves as the benchmark for evaluating the models' accuracy. Conducted within a hospital's internal environment, the study emphasizes the clinical applicability of these findings. The results reveal significant variations in model performance, with the 70 billion parameter model achieving remarkable accuracy in the five-shot scenario, demonstrating the potential of sLLMs in enhancing the efficiency and accuracy of data extraction from pathology reports. The study highlights the importance of example-driven learning and the trade-offs between model size, accuracy, hallucination rates, and processing time. These findings contribute to the ongoing efforts to integrate advanced language models into clinical settings, potentially transforming patient care and biomedical research by mitigating the limitations of manual data extraction processes.
What problem does this paper attempt to address?