Abstract:Background: Medical research with real-world clinical data can be challenging due to privacy requirements. Ideally, patient data are handled in a fully pseudonymised or anonymised way. However, this can make it difficult for medical researchers to access and analyze large datasets or to exchange data between hospitals. De-identifying medical free text is particularly difficult due to the diverse documentation styles and the unstructured nature of the data. However, recent advancements in natural language processing (NLP), driven by the development of large language models (LLMs), have revolutionized the ability to extract information from unstructured text. Methods: We hypothesize that LLMs are highly effective tools for extracting patient-related information, which can subsequently be used to de-identify medical reports. To test this hypothesis, we conduct a benchmark study using eight locally deployable LLMs (Llama-3 8B, Llama-3 70B, Llama-2 7B, Llama-2 70B, Llama-2 7B "Sauerkraut", Llama-2 70B "Sauerkraut", Mistral 7B, and Phi-3-mini) to extract patient-related information from a dataset of 100 real-world clinical letters. We then remove the identified information using our newly developed LLM-Anonymizer pipeline. Results: Our results demonstrate that the LLM-Anonymizer, when used with Llama-3 70B, achieved a success rate of 98.05% in removing text characters carrying personal identifying information. When evaluating the performance in relation to the number of characters manually identified as containing personal information and identifiable characteristics, our system missed only 1.95% of personal identifying information and erroneously redacted only 0.85% of the characters. Conclusion: We provide our full LLM-based Anonymizer pipeline under an open source license with a user-friendly web interface that operates on local hardware and requires no programming skills. This powerful tool has the potential to significantly facilitate medical research by enabling the secure and efficient de-identification of clinical free text data on premise, thereby addressing key challenges in medical data sharing.

Anonymization of German financial documents using neural network-based language models with contextual word representations

Man vs the machine in the struggle for effective text anonymisation in the age of large language models

NLP-based Decision Support System for Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative Study

Leaking Sensitive Financial Accounting Data in Plain Sight using Deep Autoencoder Neural Networks

Decision support from financial disclosures with deep neural networks and transfer learning

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Unlocking the Potential of Large Language Models for Clinical Text Anonymization: A Comparative Study

Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports

Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches

German FinBERT: A German Pre-trained Language Model

Information extraction from German radiological reports for general clinical text and language understanding

Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer

A Multi-Modal Approach for the Detection of Account Anonymity on Social Media Platforms

Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models

Unveiling AI-Generated Financial Text: A Computational Approach Using Natural Language Processing and Generative Artificial Intelligence

GERNERMED: An open German medical NER model

Robust Utility-Preserving Text Anonymization Based on Large Language Models

Zero-Shot Text Matching for Automated Auditing using Sentence Transformers