EchoNet-Synthetic: Privacy-preserving Video Generation for Safe Medical Data Sharing

Hadrien Reynaud,Qingjie Meng,Mischa Dombrowski,Arijit Ghosh,Thomas Day,Alberto Gomez,Paul Leeson,Bernhard Kainz
2024-06-03
Abstract:To make medical datasets accessible without sharing sensitive patient information, we introduce a novel end-to-end approach for generative de-identification of dynamic medical imaging data. Until now, generative methods have faced constraints in terms of fidelity, spatio-temporal coherence, and the length of generation, failing to capture the complete details of dataset distributions. We present a model designed to produce high-fidelity, long and complete data samples with near-real-time efficiency and explore our approach on a challenging task: generating echocardiogram videos. We develop our generation method based on diffusion models and introduce a protocol for medical video dataset anonymization. As an exemplar, we present EchoNet-Synthetic, a fully synthetic, privacy-compliant echocardiogram dataset with paired ejection fraction labels. As part of our de-identification protocol, we evaluate the quality of the generated dataset and propose to use clinical downstream tasks as a measurement on top of widely used but potentially biased image quality metrics. Experimental outcomes demonstrate that EchoNet-Synthetic achieves comparable dataset fidelity to the actual dataset, effectively supporting the ejection fraction regression task. Code, weights and dataset are available at <a class="link-external link-https" href="https://github.com/HReynaud/EchoNet-Synthetic" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the contradiction between privacy protection and sharing of medical datasets. Specifically, the author proposes a new end - to - end method for generating high - quality synthetic medical imaging data, especially echocardiogram videos, which can support medical research and clinical applications without revealing patients' sensitive information. The key challenges mentioned in the paper include: 1. **Data Privacy**: Medical data contains a large amount of sensitive information, and directly sharing this data will bring privacy risks. Therefore, a method is needed to generate synthetic data for research and training without revealing personal identity information. 2. **Data Quality and Diversity**: The generated synthetic data should not only maintain the quality and diversity of the original data, but also be able to support the performance of downstream tasks, such as the regression task of ejection fraction (LVEF). This means that the generated data must be highly realistic both visually and temporally. 3. **Generation Efficiency**: In order to make the generated data practically applicable to large - scale research projects, the generation process needs to be efficient and be able to generate long and coherent video sequences within a reasonable time. To solve the above problems, the author introduces the Latent Video Diffusion Model (LVDM) based on the diffusion model and designs a complete set of data generation and privacy filtering protocols. This set of protocols not only ensures the quality of the generated data, but also verifies the effectiveness of the data through the evaluation of downstream tasks. In addition, the author also releases the EchoNet - Synthetic dataset, which is a fully synthetic echocardiogram dataset that meets privacy requirements and aims to promote research in related fields.