Abstract:Background: The narrative free-text data in electronic medical records (EMRs) contain valuable clinical information for analysis and research to inform better patient care. However, the release of free text for secondary use is hindered by concerns surrounding personally identifiable information (PII), as protecting individuals' privacy is paramount. Therefore, it is necessary to deidentify free text to remove PII. Manual deidentification is a time-consuming and labor-intensive process. Numerous automated deidentification approaches and systems have been attempted to overcome this challenge over the past decade. Objective: We sought to develop an accurate, web-based system deidentifying free text (DEFT), which can be readily and easily adopted in real-world settings for deidentification of free text in EMRs. The system has several key features including a simple and task-focused web user interface, customized PII types, use of a state-of-the-art deep learning model for tagging PII from free text, preannotation by an interactive learning loop, rapid manual annotation with autosave, support for project management and team collaboration, user access control, and central data storage. Methods: DEFT comprises frontend and backend modules and communicates with central data storage through a filesystem path access. The frontend web user interface provides end users with a user-friendly workspace for managing and annotating free text. The backend module processes the requests from the frontend and performs relevant persistence operations. DEFT manages the deidentification workflow as a project, which can contain one or more data sets. Customized PII types and user access control can also be configured. The deep learning model is based on a Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) with RoBERTa as the word embedding layer. The interactive learning loop is further integrated into DEFT to speed up the deidentification process and increase its performance over time. Results: DEFT has many advantages over existing deidentification systems in terms of its support for project management, user access control, data management, and an interactive learning process. Experimental results from DEFT on the 2014 i2b2 data set obtained the highest performance compared to 5 benchmark models in terms of microaverage strict entity-level recall and F1-scores of 0.9563 and 0.9627, respectively. In a real-world use case of deidentifying clinical notes, extracted from 1 referral hospital in Sydney, New South Wales, Australia, DEFT achieved a high microaverage strict entity-level F1-score of 0.9507 on a corpus of 600 annotated clinical notes. Moreover, the manual annotation process with preannotation demonstrated a 43% increase in work efficiency compared to the process without preannotation. Conclusions: DEFT is designed for health domain researchers and data custodians to easily deidentify free text in EMRs. DEFT supports an interactive learning loop and end users with minimal technical knowledge can perform the deidentification work with only a shallow learning curve.

A Deep Learning-Based System for the MEDDOCAN Task

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Application of Chinese medical document anonymization in EMR system

A Study of Deep Learning Methods for De-Identification of Clinical Notes in Cross-Institute Settings

A Study of Deep Learning Methods for De-identification of Clinical Notes at Cross Institute Settings.

A Cascaded Approach for Chinese Clinical Text De-Identification with Less Annotation Effort

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

Automatic De-Identification of Electronic Medical Records Using Token-Level and Character-Level Conditional Random Fields

De-identification of Clinical Notes Via Recurrent Neural Network and Conditional Random Field

DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data

A Hybrid Machine Learning Method for the De-identification of Un-Structured Narrative Clinical Text in Multi-center Chinese Electronic Medical Records Data

A Machine Learning Based Approach to Identify Protected Health Information in Chinese Clinical Text

De-identifying Australian Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study

De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks

De-identification of Medical Records Using Conditional Random Fields and Long Short-Term Memory Networks

A Study of Deep Learning Approaches for Medication and Adverse Drug Event Extraction from Clinical Text.

MADEx: A System for Detecting Medications, Adverse Drug Events, and Their Relations from Clinical Notes

Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports

DI++: A deep learning system for patient condition identification in clinical notes