LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy pre-serving Large Language Models

Isabella C. Wiest,Fabian Wolf,Marie-Elisabeth Lessmann,Marko van Treeck,Dyke Ferber,Jiefu Zhu,Heiko Boehme,Keno K. Bressem,Hannes Ulrich,Matthias P. Ebert,Jakob Nikolas Kather
DOI: https://doi.org/10.1101/2024.09.02.24312917
2024-09-03
Abstract:In clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis. The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.
What problem does this paper attempt to address?
The paper attempts to address the problem of efficiently extracting structured information from unstructured medical texts (such as clinical letters or procedure reports) in clinical science and practice. This type of data, being unstructured, cannot be directly used for quantitative research, and manual review or structured information retrieval is time-consuming and costly. The paper proposes an information extraction (IE) workflow based on large language models (LLM) (LLM-AIx), which can use privacy-preserving LLMs to extract predefined entities from unstructured texts, thereby converting clinical texts into structured data. This solves a key barrier to efficient information extraction in clinical research and practice, helping to improve clinical decision-making, enhance patient treatment outcomes, and facilitate large-scale data analysis. Additionally, this method can be integrated into local hospital hardware without transferring any patient data to external servers, ensuring data security and privacy.