FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data

Yukun Zhang,Guanzhong Chen,Zenglin Xu,Jianyong Wang,Dun Zeng,Junfan Li,Jinghua Wang,Yuan Qi,Irwin King
2024-10-28
Abstract:Cardiovascular diseases (CVDs) are currently the leading cause of death worldwide, highlighting the critical need for early diagnosis and treatment. Machine learning (ML) methods can help diagnose CVDs early, but their performance relies on access to substantial data with high quality. However, the sensitive nature of healthcare data often restricts individual clinical institutions from sharing data to train sufficiently generalized and unbiased ML models. Federated Learning (FL) is an emerging approach, which offers a promising solution by enabling collaborative model training across multiple participants without compromising the privacy of the individual data owners. However, to the best of our knowledge, there has been limited prior research applying FL to the cardiovascular disease domain. Moreover, existing FL benchmarks and datasets are typically simulated and may fall short of replicating the complexity of natural heterogeneity found in realistic datasets that challenges current FL algorithms. To address these gaps, this paper presents the first real-world FL benchmark for cardiovascular disease detection, named FedCVD. This benchmark comprises two major tasks: electrocardiogram (ECG) classification and echocardiogram (ECHO) segmentation, based on naturally scattered datasets constructed from the CVD data of seven institutions. Our extensive experiments on these datasets reveal that FL faces new challenges with real-world non-IID and long-tail data. The code and datasets of FedCVD are available <a class="link-external link-https" href="https://github.com/SMILELab-FL/FedCVD" rel="external noopener nofollow">this https URL</a>.
Signal Processing,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the application challenges of cardiovascular disease (CVDs) data in federated learning (FL). Specifically, the paper points out that although machine - learning methods can assist in the early diagnosis of cardiovascular diseases, the performance of these methods depends on high - quality large - scale data. However, due to the high sensitivity of medical data, a single clinical institution is usually unable to share data to train a machine - learning model that is general enough and unbiased. As an emerging method, federated learning can achieve collaborative model training across multiple participants without violating the privacy of personal data. But currently, research in the field of cardiovascular diseases is limited, and most of the existing federated learning benchmarks and datasets are simulated, making it difficult to fully reflect the complexity and heterogeneity of real - world data. To fill this gap, the paper introduces FedCVD, the first real - world federated learning benchmark for cardiovascular diseases. FedCVD is constructed based on real cardiovascular disease data from seven medical institutions, covering two main tasks: electrocardiogram (ECG) classification and echocardiogram (ECHO) segmentation. Through these tasks, the paper reveals the challenges faced by federated learning when dealing with real - world data such as non - independent and identically distributed (non - IID), long - tail distribution, and incomplete labels, and provides corresponding evaluation metrics and experimental results to support the design of more effective federated learning algorithms. ### Main Contributions 1. **Introduction of FedCVD**: An open - source multi - center medical dataset and benchmark specifically for the field of cardiovascular diseases. To the best of the authors' knowledge, this is currently the largest multi - center cardiovascular disease benchmark dataset, containing multi - label classification and segmentation tasks, and the dataset uses a natural partitioning strategy. 2. **Emphasis on Three Key Features**: In the cardiovascular disease federated learning scenario, non - independent and identically distributed (non - IID), long - tail distribution, and label incompleteness are three important features, which pose significant challenges to existing federated learning algorithms. 3. **Extensive Experimental Evaluation**: The paper conducts an extensive experimental evaluation of mainstream federated learning and centralized learning methods, verifies the effectiveness of federated learning in the field of cardiovascular diseases, and highlights the importance of the above three challenges. In addition, the paper also provides open - source code so that other researchers can reproduce the experimental results and integrate them into different federated learning frameworks. ### Specific Problems - **Non - independent and identically distributed (non - IID)**: There are significant differences in features and labels among data from different institutions, which may lead to difficulties in the convergence of the global model. - **Long - tail distribution**: The label distribution within and among institutions has an obvious long - tail characteristic, that is, a few labels dominate, while most label samples are scarce. This is particularly prominent in the federated learning scenario because data imbalance and non - independent and identically distributed labels will further exacerbate this problem. - **Label incompleteness**: Different institutions have different annotation capabilities, resulting in different label completeness. For example, some institutions may only be able to identify some key areas, while others can identify all key areas. This incompleteness will affect the performance of the model in unrecognized areas. Through these contributions, the paper provides an important foundation and reference for federated learning research in the field of cardiovascular diseases.