Abstract:BACKGROUND:De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.METHODS:We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.RESULTS:Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.CONCLUSIONS:It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding

PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs

Life of PII -- A PII Obfuscation Transformer

ProPILE: Probing Privacy Leakage in Large Language Models

Teach LLMs to Phish: Stealing Private Information from Language Models

Combing for Credentials: Active Pattern Extraction from Smart Reply

A Study of Deep Learning Methods for De-Identification of Clinical Notes in Cross-Institute Settings

The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks

PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles

Analyzing Leakage of Personally Identifiable Information in Language Models

LLM-PBE: Assessing Data Privacy in Large Language Models

Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models

Quantifying Association Capabilities of Large Language Models and Its Implications on Privacy Leakage

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

A Privacy-Preserving Approach to Extraction of Personal Information through Automatic Annotation and Federated Learning

Protecting Your LLMs with Information Bottleneck

Identifying and Mitigating Privacy Risks Stemming from Language Models: A Survey

Enhancing Data Privacy in Large Language Models through Private Association Editing

DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts

Large Language Models Can Be Good Privacy Protection Learners