Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation

Hadrien Reynaud,Athanasios Vlontzos,Benjamin Hou,Arian Beqiri,Paul Leeson,Bernhard Kainz
DOI: https://doi.org/10.48550/arXiv.2107.00977
2021-07-02
Abstract:Cardiac ultrasound imaging is used to diagnose various heart diseases. Common analysis pipelines involve manual processing of the video frames by expert clinicians. This suffers from intra- and inter-observer variability. We propose a novel approach to ultrasound video analysis using a transformer architecture based on a Residual Auto-Encoder Network and a BERT model adapted for token classification. This enables videos of any length to be processed. We apply our model to the task of End-Systolic (ES) and End-Diastolic (ED) frame detection and the automated computation of the left ventricular ejection fraction. We achieve an average frame distance of 3.36 frames for the ES and 7.17 frames for the ED on videos of arbitrary length. Our end-to-end learnable approach can estimate the ejection fraction with a MAE of 5.95 and $R^2$ of 0.52 in 0.15s per video, showing that segmentation is not the only way to predict ejection fraction. Code and models are available at <a class="link-external link-https" href="https://github.com/HReynaud/UVT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to automatically detect the left ventricular ejection fraction (LVEF) from cardiac ultrasound videos by using a new method based on the Transformer architecture. Specifically, the paper proposes a new method for ultrasound video analysis. This method can handle videos of arbitrary length and, without human intervention, automatically identify the end - systolic (ES) and end - diastolic (ED) frames, and then calculate the left ventricular ejection fraction. This method aims to overcome the inter - and intra - observer variability in traditional manual processing of video frames and improve the accuracy and efficiency of LVEF estimation. The key issues mentioned in the paper include: - **Limitations of manual processing of video frames**: Traditional LVEF measurement depends on experts manually selecting ES and ED frames, which is not only time - consuming but also easily influenced by the subjective factors of observers, resulting in inconsistent results. - **Deficiencies of existing automated methods**: Existing automated methods for processing ultrasound videos mainly focus on discrete - frame processing and lack effective support for time - series data, unable to fully utilize the time - variation information in videos. - **The need to handle videos of arbitrary length**: The length of the cardiac cycle varies, and ultrasound videos can be very long. Therefore, a method that can handle videos of arbitrary length is required to meet different clinical needs. To solve the above problems, the authors propose a new model based on the Transformer architecture, which can simultaneously regress the LVEF value and identify the positions of ES/ED frames, thereby achieving a rapid and accurate assessment of cardiac function.