Human-level information extraction from clinical reports with fine-tuned language models
Longchao Liu,Long Lian,Yiyan Hao,Aidan Pace,Elaine Kim,Nour Homsi,Yash Pershad,Liheng Lai,Thomas Gracie,Ashwin Kishtagari,Peter Carroll,Alexander G Bick,Anobel Y Odisho,Maggie Chung,Adam Yala
DOI: https://doi.org/10.1101/2024.11.18.24317466
2024-11-20
Abstract:Background: Extracting structured data from clinical notes is a key bottleneck in developing AI tools for radiology and pathology. Manual annotation is labor-intensive and unscalable. An efficient, automated method for clinical information extraction with human-level performance is urgently needed.
Purpose: To introduce a low-code open-source library, Strata, that facilitates finetuning, evaluating, and deploying large language models (LLMs) for structured data extraction from clinical reports.
Materials and Methods: Four clinical datasets were annotated by a trained human annotator: 431 prostate MRI reports, 978 breast pathology reports, 238 kidney pathology reports, and 724 myelodysplastic syndrome (MDS) pathology reports.
Datasets were split into training, development, and test sets. A second reader annotated the test set. We fine-tuned open-source LLMs (Llama-3.1 8B, Mistral-v0.3 7B) to extract variables from clinical reports. We evaluated zero-shot and fine-tuned open-source models, zero-shot GPT-4, and the second human annotator using exact
match accuracy. Exact match accuracy assesses if all variables for a report were extracted correctly.
Results: The second human annotator obtained exact match accuracies of 83.1 (95% CI 75.4, 88.5), 75.5 (95% CI 70.4, 80.3), 70.8 (95% CI 59.7, 80.6), and 85.8 (95% CI 80.7, 89.9) in the prostate, breast, kidney, and MDS test sets, respectively. Our finetuned Llama-3.1 8B model achieved human-level performance across all fine-tuning settings with exact-match test-set accuracies of 91.5 (95% CI 85.4, 95.4), 91.5 (95% CI 88.1, 94.2), 72.2 (95% CI 61.1, 81.9), and 89.4 (95% CI 84.9, 93.1) for prostate, breast,
kidney, and MDS reports, respectively. We found that ≤ 100 training reports were
needed to achieve human-level performance on these tasks.
Conclusion: Strata enables automated human-level performance in extracting structured data from clinical notes using ≤ 100 training reports.