Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Xavier Tannier,Perceval Wajsbürt,Alice Calliger,Basile Dura,Alexandre Mouchet,Martin Hilka,Romain Bey
2023-03-24
Abstract:The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of de-identifying clinical reports in clinical data warehouses to ensure that data can be used for research purposes while safeguarding patient privacy. Specifically, the paper focuses on how to automatically remove or replace protected health information (PHIs) from electronic health records (EHRs) using natural language processing (NLP) algorithms, thereby reducing the risk of patient identification by non-care team members. This issue is critical in France and many other countries because research projects involving identifiable information typically require patient consent, and obtaining consent from each patient is impractical in studies involving thousands or even millions of patients. Therefore, the de-identification of clinical reports is essential for allowing data access for research. The paper also discusses the challenges of sharing tools and resources in this field and presents the experience of the Paris University Hospital (AP-HP) in systematically pseudonymizing text documents. The authors have developed a hybrid system that combines the results of deep learning models with manual rules, aiming to improve the accuracy and efficiency of de-identification.