Abstract:Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing deep - learning models in cardiac ultrasound image analysis, especially in the processing of cardiac ultrasound videos. Specifically, existing methods rely on a large amount of labeled data for supervised training, and the labeling of cardiac ultrasound images is very time - consuming and resource - intensive. In addition, general - purpose foundation models (such as foundation models in the field of natural images) have significant domain differences when applied to medical images, resulting in a decline in their performance. Therefore, this paper proposes a foundation model specifically designed for cardiac ultrasound images - EchoFM, aiming to effectively capture the spatial and temporal dynamic features of cardiac ultrasound videos through a self - supervised learning framework, thereby reducing the dependence on labeled data and improving the generalization ability of the model in multiple downstream tasks. ### Background and Problems of the Paper - **Importance of Cardiac Ultrasound Images**: Cardiac ultrasound is a widely used non - invasive imaging method for evaluating cardiac structure and function. It can provide rich immediate feedback and support detailed assessment based on different probe directions. - **Challenges in the Application of Deep Learning**: Although deep learning has significantly improved the automatic recognition and segmentation ability of cardiac ultrasound images, its effect highly depends on the quantity, quality, and representativeness of labeled data. Due to the complexity and time - consuming nature of the labeling process, this limits the application of deep - learning models in cardiac ultrasound image analysis. - **The Rise of Foundation Models**: Foundation models are pre - trained on large - scale unlabeled data through self - supervised learning and can learn diverse features, which are suitable for multiple downstream tasks. However, existing foundation models have a domain gap when applied to medical images, especially in the case of cardiac ultrasound images. ### Methods Proposed in the Paper - **EchoFM Model**: EchoFM is a foundation model specifically designed for cardiac ultrasound videos, which captures spatial and temporal dynamic features through a self - supervised learning framework. - **Self - supervised Learning Framework**: EchoFM introduces a spatio - temporally consistent masking strategy and a cycle - driven contrastive learning method, which can effectively learn representative video features without labels. - **Large - scale Data Set**: The model is pre - trained on a large - scale data set containing more than 290,000 cardiac ultrasound videos, covering 26 scanning views and different imaging modes, with a total of 20 million frames. - **Adaptability to Downstream Tasks**: The pre - trained EchoFM can easily adapt and be fine - tuned for multiple downstream tasks, such as view recognition, ventricular segmentation, aortic stenosis detection, and aortic regurgitation severity estimation. ### Main Contributions 1. **Proposing EchoFM**: The first foundation model specifically for cardiac ultrasound videos, which can be pre - trained on a large - scale data set, covering multiple scanning views and imaging modes. 2. **Innovative Spatio - Temporally Consistent Masking Strategy**: For the first time, a spatio - temporally consistent masking strategy specifically designed for cycle - contrast loss is proposed, which effectively overcomes the inherent spatio - temporal variability and occlusion problems in cardiac ultrasound sequences. 3. **Extensive Experimental Verification**: EchoFM has been extensively experimentally verified through public and multi - center data sets, and the results show that EchoFM outperforms all existing state - of - the - art methods in all downstream tasks, demonstrating its superior generalization ability and flexibility. ### Conclusion EchoFM effectively solves the problem of relying on labeled data in cardiac ultrasound image analysis through a self - supervised learning framework, providing a new solution for the efficient analysis of cardiac ultrasound videos. The performance of this model in multiple downstream tasks is better than existing methods, showing its great potential in clinical applications.

EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

EchoFM: A Pre-training and Fine-tuning Framework for Echocardiogram Videos Vision Foundation Model

Echo-Vision-FM: A Pre-training and Fine-tuning Framework for Echocardiogram Videos Vision Foundation Model

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

ECG-FM: An Open Electrocardiogram Foundation Model

EchoApex: A General-Purpose Vision Foundation Model for Echocardiography

Multimodal Foundation Models For Echocardiogram Interpretation

Foundation Models in Electrocardiogram: A Review

Echocardiogram Foundation Model -- Application 1: Estimating Ejection Fraction

An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains

Foundation versus Domain-Specific Model for Cardiac Ultrasound Segmentation

HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models

Foundation Models for ECG: Leveraging Hybrid Self-Supervised Learning for Advanced Cardiac Diagnostics

Vision–language foundation model for echocardiogram interpretation

EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

Improving Representation of High-frequency Components for Medical Foundation Models

Specialized Foundation Models Struggle to Beat Supervised Baselines

Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data

Comparative Eminence: Foundation versus Domain-Specific Model for Cardiac Ultrasound Segmentation

MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification