The problem of responses less than the reporting limit in unsupervised pattern recognition.

R. Aruga

DOI: https://doi.org/10.1016/j.talanta.2003.10.036

2004-04-19

Abstract:

What problem does this paper attempt to address?

Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing

Ruben Geevarghese,Carlie Sigel,John Cadley,Subrata Chatterjee,Pulkit Jain,Alex Hollingsworth,Avijit Chatterjee,Nathaniel Swinburne,Khawaja Hasan Bilal,Brett Marinelli

DOI: https://doi.org/10.1136/jcp-2024-209669

2024-09-20

Journal of Clinical Pathology

Abstract:Aims Structured reporting in pathology is not universally adopted and extracting elements essential to research often requires expensive and time-intensive manual curation. The accuracy and feasibility of using large language models (LLMs) to extract essential pathology elements, for cancer research is examined here. Methods Retrospective study of patients who underwent pathology sampling for suspected hepatocellular carcinoma and underwent Ytrrium-90 embolisation. Five pathology report elements of interest were included for evaluation. LLMs (Generative Pre-trained Transformer (GPT) 3.5 turbo and GPT-4) were used to extract elements of interest. For comparison, a rules-based, regular expressions (REGEX) approach was devised for extraction. Accuracy for each approach was calculated. Results 88 pathology reports were identified. LLMs and REGEX were both able to extract research elements with high accuracy (average 84.1%–94.8%). Conclusions LLMs have significant potential to simplify the extraction of research elements from pathology reporting, and therefore, accelerate the pace of cancer research.

pathology
Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

Mohamed Sobhi Jabal,Pranav Warman,Jikai Zhang,Kartikeye Gupta,Ayush Jain,Maciej Mazurowski,Walter Wiggins,Kirti Magudia,Evan Calabrese

2024-09-18

Abstract:Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.

Computation and Language,Information Retrieval,Machine Learning
Large language models for extracting histopathologic diagnoses from electronic health records

Brian Johnson,Tyler Bath,Xinyi Huang,Mark Lamm,Ashley Earles,Hyrum Eddington,Lily J. Jih,Samir Gupta,Shailja C. Shah,Kit Curtius

DOI: https://doi.org/10.1101/2024.11.27.24318083

2024-11-28

Abstract:Background & Aims Accurate data resources are essential for impactful medical research. To date, most large-scale studies have relied on structured sources, such as International Classification of Diseases codes, to determine patient diagnoses and outcomes. However, these structured datasets are often incomplete or inaccurate. Recent advances in natural language processing, specifically the introduction of open-weight large language models (LLMs), enable more accurate data extraction from unstructured text in electronic health records (EHRs). Methods We created an approach using LLMs for identifying histopathologic diagnoses, including presence of dysplasia and cancer, in pathology reports from the Department of Veterans Affairs Healthcare System, including those patients with genotype data within the Million Veteran Program (MVP) biobank. Our approach requires no additional training and utilizes a simple 'yes/no' question prompt to obtain an answer. We validated the method on 3 diagnostic tasks by applying the same prompts to reports from patients with vs without diagnoses of inflammatory bowel disease (IBD) and calculating F-1 scores as a balanced accuracy measure. Results In patients without IBD in MVP, we achieved F1-scores of 99.3% for identifying any dysplasia, 98.2% for identifying high-grade dysplasia and/or colorectal adenocarcinoma (HGD/CRC), and 96.2% for identifying CRC using LLM Gemma-2. In IBD patients in MVP, we achieved F1-scores of 97.1% for identifying dysplasia, 96.4% for identifying HGD/CRC, and 97.1% for identifying CRC. Conclusion LLMs provide excellent accuracy in extracting diagnoses from EHRs and can be applied to a variety of tasks with no additional human-led development required. Our validated methods generalized to unstructured pathology notes, even withstanding challenges of resource-limited computing environments.

Health Informatics
Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Daniel Truhn,Chiara Ml Loeffler,Gustav Müller-Franzes,Sven Nebelung,Katherine J Hewitt,Sebastian Brandner,Keno K Bressem,Sebastian Foersch,Jakob Nikolas Kather

DOI: https://doi.org/10.1002/path.6232

Abstract:Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer

Rajesh Bhayana,Bipin Nanda,Taher Dehkharghanian,Yangqing Deng,Nishaant Bhambra,Gavin Elias,Daksh Datta,Avinash Kambadakone,Chaya G. Shwaartz,Carol-Anne Moulton,David Henault,Steven Gallinger,Satheesh Krishna

DOI: https://doi.org/10.1148/radiol.233117

IF: 19.7

2024-06-20

Radiology

Abstract:Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved...

radiology, nuclear medicine & medical imaging
Privacy-preserving large language models for structured medical information retrieval

Isabella Catharina Wiest,Dyke Ferber,Jiefu Zhu,Marko van Treeck,Sonja K. Meyer,Radhika Juglan,Zunamys I. Carrero,Daniel Paech,Jens Kleesiek,Matthias P. Ebert,Daniel Truhn,Jakob Nikolas Kather

DOI: https://doi.org/10.1038/s41746-024-01233-2

IF: 15.2

2024-09-21

npj Digital Medicine

Abstract:Most clinical information is encoded as free text, not accessible for quantitative analysis. This study presents an open-source pipeline using the local large language model (LLM) "Llama 2" to extract quantitative information from clinical text and evaluates its performance in identifying features of decompensated liver cirrhosis. The LLM identified five key clinical features in a zero- and one-shot manner from 500 patient medical histories in the MIMIC IV dataset. We compared LLMs of three sizes and various prompt engineering approaches, with predictions compared against ground truth from three blinded medical experts. Our pipeline achieved high accuracy, detecting liver cirrhosis with 100% sensitivity and 96% specificity. High sensitivities and specificities were also yielded for detecting ascites (95%, 95%), confusion (76%, 94%), abdominal pain (84%, 97%), and shortness of breath (87%, 97%) using the 70 billion parameter model, which outperformed smaller versions. Our study successfully demonstrates the capability of locally deployed LLMs to extract clinical information from free text with low hardware requirements.

health care sciences & services,medical informatics
CORAL: Expert-Curated medical Oncology Reports to Advance Language Model Inference

Madhumita Sushil,Vanessa E. Kennedy,Divneet Mandair,Brenda Y. Miao,Travis Zack,Atul J. Butte

DOI: https://doi.org/10.1056/AIdbp2300110

2024-01-12

Abstract:Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented in clinical notes. Despite their vital role, no current oncology information representation and annotation schema fully encapsulates the diversity of information recorded within these notes. Although large language models (LLMs) have recently exhibited impressive performance on various medical natural language processing tasks, due to the current lack of comprehensively annotated oncology datasets, an extensive evaluation of LLMs in extracting and reasoning with the complex rhetoric in oncology notes remains understudied. We developed a detailed schema for annotating textual oncology information, encompassing patient characteristics, tumor characteristics, tests, treatments, and temporality. Using a corpus of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, we applied this schema to assess the zero-shot abilities of three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to extract detailed oncological history from two narrative sections of clinical progress notes. Our team annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.73, an average ROUGE score of 0.72, an exact-match F1-score of 0.51, and an average accuracy of 68% on complex tasks (expert manual evaluation on subset). Notably, it was proficient in tumor characteristic and medication extraction, and demonstrated superior performance in relational inference like adverse event detection. However, further improvements are needed before using it to reliably extract important facts from cancer progress notes needed for clinical research, complex population management, and documenting quality patient care.

Computation and Language
Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports

Denise T Lee,Akhil Vaid,Kartikeya M Menon,Robert Freeman,David S Matteson,Michael P Marin,Girish N Nadkarni,Denise Lee,Kartikeya Menon,David Matteson,Michael Marin,Girish Nadkarni

DOI: https://doi.org/10.1101/2023.11.08.23298252

2023-11-09

MedRxiv

Abstract:Introduction: Popularized by ChatGPT, large language models (LLM) are poised to transform the scalability of clinical natural language processing (NLP) downstream tasks such as medical question answering (MQA) and may enhance the ability to rapidly and accurately extract key information from clinical narrative reports. However, the use of LLMs in the healthcare setting is limited by cost, computing power and concern for patient privacy. In this study we evaluate the extraction performance of a privacy preserving LLM for automated MQA from surgical pathology reports. Study Design: 84 thyroid cancer surgical pathology reports were assessed by two independent reviewers and the open-source FastChat-T5 3B-parameter LLM using institutional computing resources. Longer text reports were converted to embeddings. 12 medical questions for staging and recurrence risk data extraction were formulated and answered for each report. Time to respond and concordance of answers were evaluated. Results: Out of a total of 1008 questions answered, reviewers 1 and 2 had an average concordance rate of responses of 99.1% (SD: 1.0%). The LLM was concordant with reviewers 1 and 2 at an overall average rate of 88.86% (SD: 7.02%) and 89.56% (SD: 7.20%). The overall time to review and answer questions for all reports was 206.9, 124.04 and 19.56 minutes for Reviewers 1, 2 and LLM, respectively. Conclusion: A privacy preserving LLM may be used for MQA with considerable time-saving and an acceptable accuracy in responses. Prompt engineering and fine tuning may further augment automated data extraction from clinical narratives for the provision of real-time, essential clinical insights.
From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents

Isabella Catharina Wiest,Dyke Ferber,Jiefu Zhu,Marko Van Treeck,Sonja Katharina Meyer,Radhika Juglan,Zunamys I. Carrero,Daniel Paech,Jens Kleesiek,Matthias P. Ebert,Daniel Truhn,Jakob Nikolas Kather

DOI: https://doi.org/10.1101/2023.12.07.23299648

2023-12-10

MedRxiv

Abstract:Background and Aims Most clinical information is encoded as text, but extracting quantitative information from text is challenging. Large Language Models (LLMs) have emerged as powerful tools for natural language processing and can parse clinical text. However, many LLMs including ChatGPT reside in remote data centers, which disqualifies them from processing personal healthcare data. We present an open-source pipeline using the local LLM 'Llama 2' for extracting quantitative information from clinical text and evaluate its use to detect clinical features of decompensated liver cirrhosis. Methods We tasked the LLM to identify five key clinical features of decompensated liver cirrhosis in a zero- and one-shot way without any model training. Our specific objective was to identify abdominal pain, shortness of breath, confusion, liver cirrhosis, and ascites from 500 patient medical histories from the MIMIC IV dataset. We compared LLMs with three different sizes and a variety of pre-specified prompt engineering approaches. Model predictions were compared against the ground truth provided by the consent of three blinded medical experts. Results Our open-source pipeline yielded in highly accurate extraction of quantitative features from medical free text. Clinical features which were explicitly mentioned in the source text, such as liver cirrhosis and ascites, were detected with a sensitivity of 100% and 95% and a specificity of 96% and 95%, respectively from the 70 billion parameter model. Other clinical features, which are often paraphrased in a variety of ways, such as the presence of confusion, were detected only with a sensitivity of 76% and a specificity of 94%. Abdominal pain was detected with a sensitivity of 84% and a specificity of 97%. Shortness of breath was detected with a sensitivity of 87% and a specificity of 96%. The larger version of Llama 2 with 70b parameters outperformed the smaller version with 7b parameters in all tasks. Prompt engineering improved zero-shot performance, particularly for smaller model sizes. Conclusion Our study successfully demonstrates the capability of using locally deployed LLMs to extract clinical information from free text. The hardware requirements are so low that not only on-premise, but also point-of-care deployment of LLMs are possible.
Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Zhongheng Zhang,Hongying Ni

DOI: https://doi.org/10.1016/j.jointm.2024.09.002

2024-01-01

Journal of Intensive Medicine

Abstract:The integration of large language models (LLMs) in clinical medicine, particularly in critical care, has introduced transformative capabilities for analyzing and managing complex medical information. This technical note explores the application of LLMs, such as generative pretrained transformer 4 (GPT-4) and Qwen-Chat, in interpreting electronic healthcare records to assist with rapid patient condition assessments, predict sepsis, and automate the generation of discharge summaries. The note emphasizes the significance of LLMs in processing unstructured data from electronic health records (EHRs), extracting meaningful insights, and supporting personalized medicine through nuanced understanding of patient histories. Despite the technical complexity of deploying LLMs in clinical settings, this document provides a comprehensive guide to facilitate the effective integration of LLMs into clinical workflows, focusing on the use of DashScope's application programming interface (API) services for judgment on patient prognosis and organ support recommendations based on natural language in EHRs. By illustrating practical steps and best practices, this work aims to lower the technical barriers for clinicians and researchers, enabling broader adoption of LLMs in clinical research and practice to enhance patient care and outcomes.
Large Multimodal Model based Standardisation of Pathology Reports with Confidence and their Prognostic Significance

Ethar Alzaid,Gabriele Pergola,Harriet Evans,David Snead,Fayyaz Minhas

2024-05-03

Abstract:Pathology reports are rich in clinical and pathological details but are often presented in free-text format. The unstructured nature of these reports presents a significant challenge limiting the accessibility of their content. In this work, we present a practical approach based on the use of large multimodal models (LMMs) for automatically extracting information from scanned images of pathology reports with the goal of generating a standardised report specifying the value of different fields along with estimated confidence about the accuracy of the extracted fields. The proposed approach overcomes limitations of existing methods which do not assign confidence scores to extracted fields limiting their practical use. The proposed framework uses two stages of prompting a Large Multimodal Model (LMM) for information extraction and validation. The framework generalises to textual reports from multiple medical centres as well as scanned images of legacy pathology reports. We show that the estimated confidence is an effective indicator of the accuracy of the extracted information that can be used to select only accurately extracted fields. We also show the prognostic significance of structured and unstructured data from pathology reports and show that the automatically extracted field values significant prognostic value for patient stratification. The framework is available for evaluation via the URL:

Computation and Language
A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Madhumita Sushil,Travis Zack,Divneet Mandair,Zhiwei Zheng,Ahmed Wali,Yan-Ning Yu,Yuwei Quan,Dmytro Lituiev,Atul J Butte

DOI: https://doi.org/10.1093/jamia/ocae146

2024-10-01

Abstract:Objective: Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations. Materials and methods: We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model. Results: Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set. Discussion: On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results. Conclusions: GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.
Transformative potential of Large Language Models in data mining on Electronic Health Records.

Amadeo Jesus Wals Zurita Sr.,Hector Miras del Rio Sr.,Nerea Ugarte Ruiz de Aguirre,Cristina Nebrera Navarro,Maria Rubio Jimenez,David Munoz Carmona,Carlos Miguez Sanchez

DOI: https://doi.org/10.1101/2024.03.07.24303588

2024-10-14

Abstract:Introduction: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of Large Language Models (LLMs) in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Methods: We implemented a script using the OpenAI API to extract structured information in JSON format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by five specialists in radiation oncology. We compared the results using metrics such as Sensitivity, Specificity, Precision, Accuracy, F-value, Kappa index, and the McNemar test, in addition to examining the common causes of errors in both humans and GPT models. Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemars test p = 0.79). GPT-4 demonstrated clear superiority in several key metrics (McNemars test p < 0.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs. 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred non-explicit comorbidities, sometimes correctly, though this also resulted in more false positives. Conclusion: This study demonstrates that, with well-designed prompts, the LLMs examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.

Health Informatics
Assessing Large Language Models for Oncology Data Inference from Radiology Reports

Li-Ching Chen,Travis Zack,Arda Demirci,Madhumita Sushil,Brenda Miao,Corynn Kasap,Atul J Butte,Eric Collisson,Julian Hong

DOI: https://doi.org/10.1101/2024.05.23.24307579

2024-05-23

Abstract:Key Points Purpose: We examined the effectiveness of proprietary and open Large Language Models (LLMs) in detecting disease presence, location, and treatment response in pancreatic cancer from radiology reports. Methods: We analyzed 203 deidentified radiology reports, manually annotated for disease status, location, and indeterminate nodules needing follow-up. Utilizing GPT-4, GPT-3.5-turbo, and open models like Gemma-7B and Llama3-8B, we employed strategies such as ablation and prompt engineering to boost accuracy. Discrepancies between human and model interpretations were reviewed by a secondary oncologist. Results: Among 164 pancreatic adenocarcinoma patients, GPT-4 showed the highest accuracy in inferring disease status, achieving a 75.5% correctness (F1-micro). Open models Mistral-7B and Llama3-8B performed comparably, with accuracies of 68.6% and 61.4%, respectively. Mistral-7B excelled in deriving correct inferences from "Objective Findings" directly. Most tested models demonstrated proficiency in identifying disease containing anatomical locations from a list of choices, with GPT-4 and Llama3-8B showing near parity in precision and recall for disease site identification. However, open models struggled with differentiating benign from malignant post-surgical changes, impacting their precision in identifying findings indeterminate for cancer. A secondary review occasionally favored GPT-3.5's interpretations, indicating the variability in human judgment. Conclusion: LLMs, especially GPT-4, are proficient in deriving oncological insights from radiology reports. Their performance is enhanced by effective summarization strategies, demonstrating their potential in clinical support and healthcare analytics. This study also underscores the possibility of zero-shot open model utility in environments where proprietary models are restricted. Finally, by providing a set of annotated radiology reports, this paper presents a valuable dataset for further LLM research in oncology.
Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale

See Boon Tay,Guat Hwa Low,Gillian Jing En Wong,Han Jieh Tey,Fun Loon Leong,Constance Li,Melvin Lee Kiang Chua,Daniel Shao Weng Tan,Choon Hua Thng,Iain Bee Huat Tan,Ryan Shea Ying Cong Tan

DOI: https://doi.org/10.1200/CCI.23.00122

Abstract:Purpose: To evaluate natural language processing (NLP) methods to infer metastatic sites from radiology reports. Methods: A set of 4,522 computed tomography (CT) reports of 550 patients with 14 types of cancer was used to fine-tune four clinical large language models (LLMs) for multilabel classification of metastatic sites. We also developed an NLP information extraction (IE) system (on the basis of named entity recognition, assertion status detection, and relation extraction) for comparison. Model performances were measured by F1 scores on test and three external validation sets. The best model was used to facilitate analysis of metastatic frequencies in a cohort study of 6,555 patients with 53,838 CT reports. Results: The RadBERT, BioBERT, GatorTron-base, and GatorTron-medium LLMs achieved F1 scores of 0.84, 0.87, 0.89, and 0.91, respectively, on the test set. The IE system performed best, achieving an F1 score of 0.93. F1 scores of the IE system by individual cancer type ranged from 0.89 to 0.96. The IE system attained F1 scores of 0.89, 0.83, and 0.81, respectively, on external validation sets including additional cancer types, positron emission tomography-CT ,and magnetic resonance imaging scans, respectively. In our cohort study, we found that for colorectal cancer, liver-only metastases were higher in de novo stage IV versus recurrent patients (29.7% v 12.2%; P < .001). Conversely, lung-only metastases were more frequent in recurrent versus de novo stage IV patients (17.2% v 7.3%; P < .001). Conclusion: We developed an IE system that accurately infers metastatic sites in multiple primary cancers from radiology reports. It has explainable methods and performs better than some clinical LLMs. The inferred metastatic phenotypes could enhance cancer research databases and clinical trial matching, and identify potential patients for oligometastatic interventions.
A survey analysis of the adoption of large language models among pathologists

Thiyaphat Laohawetwanit,Daniel Gomes Pinto,Andrey Bychkov

DOI: https://doi.org/10.1093/ajcp/aqae093

2024-07-27

Abstract:Objectives: We sought to investigate the adoption and perception of large language model (LLM) applications among pathologists. Methods: A cross-sectional survey was conducted, gathering data from pathologists on their usage and views concerning LLM tools. The survey, distributed globally through various digital platforms, included quantitative and qualitative questions. Patterns in the respondents' adoption and perspectives on these artificial intelligence tools were analyzed. Results: Of 215 respondents, 100 (46.5%) reported using LLMs, particularly ChatGPT (OpenAI), for professional purposes, predominantly for information retrieval, proofreading, academic writing, and drafting pathology reports, highlighting a significant time-saving benefit. Academic pathologists demonstrated a better level of understanding of LLMs than their peers. Although chatbots sometimes provided incorrect general domain information, they were considered moderately proficient concerning pathology-specific knowledge. The technology was mainly used for drafting educational materials and programming tasks. The most sought-after feature in LLMs was their image analysis capabilities. Participants expressed concerns about information accuracy, privacy, and the need for regulatory approval. Conclusions: Large language model applications are gaining notable acceptance among pathologists, with nearly half of respondents indicating adoption less than a year after the tools' introduction to the market. They see the benefits but are also worried about these tools' reliability, ethical implications, and security.
Abstract 4966: Machine learning and large language model approach to pancancer data elements

Andrew Niederhausern,Nadia S. Bahadur,Gary Wallace,Gilan E. Saadawi,John Philip

DOI: https://doi.org/10.1158/1538-7445.am2024-4966

IF: 11.2

2024-04-04

Cancer Research

Abstract:Introductory Statement: The goal is to use machine learning (ML) and large language model (LLM) to augment the manual curation of cancer data elements. Introduction: Memorial Sloan Kettering Cancer Center (MSKCC) has ~100,000 cancer patients and counting with genomic testing. Clinicians use genomic data for research but lack clinical data to analyze together. We use a vendor, VASTA Global to hire curators to manually curate cancer patient's core clinical data elements (CCDE) within unstructured/paragraph text in electronic medical record (EMR) notes. CCDE encompasses 122 data elements that include a patient's full cancer history that can take up to 1 working day to curate. We collaborated with the Realyze Intelligence Healthcare Solutions vendor to use their AI pipeline to generate the manual curated dataset. Realyze generated the CCDE data elements such as histology, pathology site, MMR, TNM staging, ECOG, and KPS for a pilot lung cancer cohort of 150 patients. We manually validated the generated data for 74 out of 150 patients. Methods:The Realyze platform uses a combination of LLMs, ML algorithms and standard terminologies to create a cancer patient model. These models are flexible enough to address the unique needs and challenges of a pan-cancer oncology model. By using standardized FHIR export, results were delivered to a data lake solution and written into a REDCap database to enable human review. Summary:We manually assessed 74 patients. The NLP gave concordant values for MMR, KPS and TNM staging for 100% of the instances. For MMR these were all null values with false negative (FN) of 100% accuracy. Pathology site had 92.15% accuracy while histology has 97.5% accuracy. Conclusion:Will work on refining pathology site and histology's ICDO3 list to increase the percentage of accuracy. Once Realyze refines their model for these data elements we will re-run it on a larger cohort of cancer patients and calculate the accuracy. Accuracy Results Clinical data elements 74 patients assessed: Accuracy % ECOG 98.6 KPS 100 T (path) 100 T (clinical) 100 N (path) 100 N (clinical) 100 M (path) 100 M(clinical) 100 MMR 100 Histology (path) 97.5 Path site 92.15 Citation Format: Andrew Niederhausern, Nadia S. Bahadur, Gary Wallace, Gilan E. Saadawi, John Philip. Machine learning and large language model approach to pancancer data elements [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 4966.

oncology
Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

Haitham A Elmarakeby,Pavel S Trukhanov,Vidal M Arroyo,Irbaz Bin Riaz,Deborah Schrag,Eliezer M Van Allen,Kenneth L Kehl

DOI: https://doi.org/10.1186/s12859-023-05439-1

IF: 3.307

2023-09-02

BMC Bioinformatics

Abstract:Background: Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. Results: We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler "bag of words" or convolutional neural network models. Conclusion: When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.
Enhancing Clinical Data Extraction from Pathology Reports: A Comparative Analysis of Large Language Models

Sunghyeon Park,Wona Choi,InYoung Choi

DOI: https://doi.org/10.3233/SHTI240523

2024-08-22

Abstract:This study evaluates the efficacy of a small large language model (sLLM) in extracting critical information from free-text pathology reports across multiple centers, addressing the challenges posed by the narrative and complex nature of these documents. Employing three variants of the Llama 2 model, with 7 billion, 13 billion, and 70 billion parameters, the research assesses model performance in both zero-shot and five-shot settings, offering insights into the impact of example-based learning. A specialized information extraction tool utilizing regular expressions for pattern identification serves as the benchmark for evaluating the models' accuracy. Conducted within a hospital's internal environment, the study emphasizes the clinical applicability of these findings. The results reveal significant variations in model performance, with the 70 billion parameter model achieving remarkable accuracy in the five-shot scenario, demonstrating the potential of sLLMs in enhancing the efficiency and accuracy of data extraction from pathology reports. The study highlights the importance of example-driven learning and the trade-offs between model size, accuracy, hallucination rates, and processing time. These findings contribute to the ongoing efforts to integrate advanced language models into clinical settings, potentially transforming patient care and biomedical research by mitigating the limitations of manual data extraction processes.
Synoptic Reporting by Summarizing Cancer Pathology Reports using Large Language Models

Sivaraman Rajaganapathy,Shaika Chowdhury,Vincent Buchner,Zhe He,Xiaoqian Jiang,Ping Yang,James R Cerhan,Nansu Zong

DOI: https://doi.org/10.1101/2024.04.26.24306452

2024-05-09

Abstract:Background: Synoptic reporting, the documenting of clinical information in a structured manner, is known to improve patient care by reducing errors, increasing readability, interoperability, and report completeness. Despite its advantages, manually synthesizing synoptic reports from narrative reports is expensive and error prone when the number of structured fields are many. While the recent revolutionary developments in Large Language Models (LLMs) have significantly advanced natural language processing, their potential for innovations in medicine is yet to be fully evaluated. Objectives: In this study, we explore the strengths and challenges of utilizing the state-of-the-art language models in the automatic synthesis of synoptic reports. Materials and Methods: We use a corpus of 7,774 cancer related, narrative pathology reports, which have annotated reference synoptic reports from Mayo Clinic EHR. Using these annotations as a reference, we reconfigure the state-of-the-art large language models, such as LLAMA-2, to generate the synoptic reports. Our annotated reference synoptic reports contain 22 unique data elements. To evaluate the accuracy of the reports generated by the LLMs, we use several metrics including the BERT F1 Score and verify our results by manual validation. Results: We show that using fine-tuned LLAMA-2 models, we can obtain BERT Score F1 of 0.86 or higher across all data elements and BERT F1 scores of 0.94 or higher on over 50% (11 of 22) of the questions. The BERT F1 scores translate to average accuracies of 76% and as high as 81% for short clinical reports. Conclusions: We demonstrate successful automatic synoptic report generation by fine-tuning large language models.

The problem of responses less than the reporting limit in unsupervised pattern recognition.

Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing

Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

Large language models for extracting histopathologic diagnoses from electronic health records

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer

Privacy-preserving large language models for structured medical information retrieval

CORAL: Expert-Curated medical Oncology Reports to Advance Language Model Inference

Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports

From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Large Multimodal Model based Standardisation of Pathology Reports with Confidence and their Prognostic Significance

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Transformative potential of Large Language Models in data mining on Electronic Health Records.

Assessing Large Language Models for Oncology Data Inference from Radiology Reports

Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale

A survey analysis of the adoption of large language models among pathologists

Abstract 4966: Machine learning and large language model approach to pancancer data elements

Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

Enhancing Clinical Data Extraction from Pathology Reports: A Comparative Analysis of Large Language Models

Synoptic Reporting by Summarizing Cancer Pathology Reports using Large Language Models