EDMAE: An Efficient Decoupled Masked Autoencoder for Standard View Identification in Pediatric Echocardiography

Yiman Liu,Xiaoxiang Han,Tongtong Liang,Bin Dong,Jiajun Yuan,Menghan Hu,Qiaohong Liu,Jiangang Chen,Qingli Li,Yuqi Zhang
DOI: https://doi.org/10.1016/j.bspc.2023.105280
2023-08-03
Abstract:This paper introduces the Efficient Decoupled Masked Autoencoder (EDMAE), a novel self-supervised method for recognizing standard views in pediatric echocardiography. EDMAE introduces a new proxy task based on the encoder-decoder structure. The EDMAE encoder is composed of a teacher and a student encoder. The teacher encoder extracts the potential representation of the masked image blocks, while the student encoder extracts the potential representation of the visible image blocks. The loss is calculated between the feature maps output by the two encoders to ensure consistency in the latent representations they extract. EDMAE uses pure convolution operations instead of the ViT structure in the MAE encoder. This improves training efficiency and convergence speed. EDMAE is pre-trained on a large-scale private dataset of pediatric echocardiography using self-supervised learning, and then fine-tuned for standard view recognition. The proposed method achieves high classification accuracy in 27 standard views of pediatric echocardiography. To further verify the effectiveness of the proposed method, the authors perform another downstream task of cardiac ultrasound segmentation on the public dataset CAMUS. The experimental results demonstrate that the proposed method outperforms some popular supervised and recent self-supervised methods, and is more competitive on different downstream tasks.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the problem of standard view recognition in pediatric echocardiography. Specifically, the authors propose a novel self-supervised method named **Efficient Decoupled Masked Autoencoder (EDMAE)** for recognizing standard views in pediatric echocardiography. ### Background and Motivation Congenital heart diseases (CHDs) are among the most common birth defects, with approximately 100,000 to 150,000 newborns diagnosed with CHD each year in China. Early and accurate diagnosis is of great clinical significance. Transthoracic echocardiography (TTE) is a cost-effective, non-invasive, and radiation-free imaging technique that can visualize the heart in real-time dynamically. TTE has become an important tool for diagnosing and treating CHD because it can quickly detect various cardiac abnormalities. However, due to the complexity of congenital heart diseases and the diversity of spatial configurations, accurate diagnosis through TTE is time-consuming and requires experienced professionals. ### Research Objectives 1. **Automatic Recognition of Standard Views**: Develop a method that can automatically recognize standard views in pediatric echocardiography to improve diagnostic efficiency and accuracy. 2. **Reduce the Need for Labeled Data**: Utilize self-supervised learning methods to reduce the reliance on large amounts of labeled data, thereby lowering the cost and time of data annotation. 3. **Improve Model Performance**: Design an efficient decoupled masked autoencoder (EDMAE) to enhance the model's performance in various downstream tasks, such as standard view recognition and cardiac ultrasound segmentation. ### Method Overview 1. **EDMAE Model**: - **Decoupled Structure**: EDMAE employs two identical encoders, one as the teacher encoder and the other as the student encoder. The teacher encoder processes visible image patches and updates weights, while the student encoder processes masked image patches, with its weights updated from the teacher encoder. - **Feature Alignment**: By calculating the loss between the feature maps output by the two encoders, it ensures consistent representation of masked and visible image patches. - **Convolution Operations**: Pure convolution operations are used instead of the ViT structure, improving training efficiency and convergence speed. 2. **Self-Supervised Pretraining**: - Conduct self-supervised pretraining on a large-scale unlabeled pediatric cardiac ultrasound image dataset. - Use a 75% masking rate to generate high-quality reconstructed images by optimizing network parameters. 3. **Downstream Tasks**: - **Standard View Recognition**: Replace the decoder with a linear layer and fine-tune using the cross-entropy loss function. - **Cardiac Ultrasound Segmentation**: Use the decoder of DenseUNet as the segmentation head to output segmentation results and calculate the loss using Focal Loss. ### Experimental Results 1. **Experiments on Private Dataset**: - On a private dataset containing 27 standard views, the EDMAE method outperformed other mainstream classification networks and self-supervised methods in terms of overall accuracy, precision, recall, specificity, and F1 score. - Specific metrics are as follows: - Overall Accuracy: 98.48% - Average Precision: 93.20% - Average Recall: 94.62% - Average Specificity: 99.73% - Average F1 Score: 93.63% 2. **Experiments on Public Dataset CAMUS**: - The effectiveness of the EDMAE method was validated on the CAMUS dataset, showing excellent performance in the cardiac ultrasound segmentation task. ### Conclusion The EDMAE method proposed in this paper achieved significant performance improvements in the task of standard view recognition in pediatric echocardiography and also demonstrated competitive performance in the cardiac ultrasound segmentation task. This method not only reduces the reliance on large amounts of labeled data but also improves training efficiency and model performance.