Abstract:Background: Echocardiograms provide vital insights into cardiac health, but their complex, multi-dimensional data presents challenges for analysis and interpretation. Current deep learning models for echocardiogram analysis often rely on supervised training, limiting their generalizability and robustness across datasets and clinical environments. Objective: To develop and evaluate EchoVisionFM (Echocardiogram video Vision Foundation Model), a self-supervised video learning framework designed to pre-train a video encoder on large-scale, unlabeled echocardiogram data. EchoVisionFM aims to produce robust and transferrable spatiotemporal representations, improving downstream performance across diverse echocardiogram datasets and clinical conditions. Methods: Our framework employs Echo-VideoMAE, an autoencoder-based video transformer that compresses and reconstructs echocardiogram video data by masking non-overlapping video patches and leveraging a ViT encoder-decoder structure. For enhanced representation, we introduce STFF-Net, a Spatio Temporal Feature Fusion Network, to integrate spatial and temporal features from the manifold representations. We pre-trained EchoVisionFM using the MIMIC-IV-ECHO dataset and fine-tuned it on the EchoNet-Dynamic dataset for downstream tasks, including classification and regression of key cardiac parameters. Results: EchoVisionFM demonstrated superior performance in classifying left ventricular ejection fraction (LVEF), achieving an accuracy of 89.12%, an F1 score of 0.9323, and an AUC of 0.9364. In regression tasks, EchoVisionFM outperformed state-of-the-art models, with LVEF prediction reaching a mean absolute error (MAE) of 4.18% and an R2 of 0.8022. The model also showed significant improvements in estimating end-systolic and end-diastolic volumes, with R2 values of 0.8006 and 0.7296, respectively. Incorporating STFF-Net led to further performance gains across tasks. Conclusion: Our results indicate that large-scale self-supervised pre-training on echocardiogram videos enables the extraction of transferable and clinically relevant features, outperforming traditional CNN-based methods. The EchoVisionFM framework, particularly with STFF-Net, enhances the extraction of spatiotemporal features, improving the predictive accuracy for various cardiac parameters. EchoVisionFM offers a powerful, scalable approach for echocardiogram analysis, with potential applications in clinical diagnostics and research.

Multimodal Foundation Models For Echocardiogram Interpretation

Vision–language foundation model for echocardiogram interpretation

EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

EchoApex: A General-Purpose Vision Foundation Model for Echocardiography

Echo-Vision-FM: A Pre-training and Fine-tuning Framework for Echocardiogram Videos Vision Foundation Model

PanEcho: Complete AI-enabled echocardiography interpretation with multi-task deep learning

GEMTrans: A General, Echocardiography-based, Multi-Level Transformer Framework for Cardiovascular Diagnosis

Echocardiogram Foundation Model -- Application 1: Estimating Ejection Fraction

EchoFM: A Pre-training and Fine-tuning Framework for Echocardiogram Videos Vision Foundation Model

ECG-Chat: A Large ECG-Language Model for Cardiac Disease Diagnosis

HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

Multimodal Foundation Models Exploit Text to Make Medical Image Predictions

Fast and accurate classification of echocardiograms using deep learning

Using deep learning to predict cardiovascular magnetic resonance findings from echocardiography videos

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

SIGxCL: A Signal-Image-Graph Cross-Modal Contrastive Learning Framework for CVD Diagnosis Based on Internet of Medical Things

Foundation versus Domain-Specific Model for Cardiac Ultrasound Segmentation

Teach Multimodal LLMs to Comprehend Electrocardiographic Images