A Multicenter, Scan-Rescan, Human and Machine Learning CMR Study to Test Generalizability and Precision in Imaging Biomarker Analysis
Anish Bhuva,Wenjia Bai,Clement Lau,Rhodri Davies,Yang Ye,Heeraj Bulluck,Elisa McAlindon,Veronica Culotta,Peter Swoboda,Gabriella Captur,Thomas Treibel,Joao Augusto,Kristopher Knott,Andreas Seraphim,Graham Cole,Steffen Petersen,Nicola Edwards,John Greenwood,Chiara Bucciarelli-Ducci,Alun Hughes,Daniel Rueckert,James Moon,Charlotte Manisty
DOI: https://doi.org/10.1161/CIRCIMAGING.119.009214
Abstract:Background: Automated analysis of cardiac structure and function using machine learning (ML) has great potential, but is currently hindered by poor generalizability. Comparison is traditionally against clinicians as a reference, ignoring inherent human inter- and intraobserver error, and ensuring that ML cannot demonstrate superiority. Measuring precision (scan:rescan reproducibility) addresses this. We compared precision of ML and humans using a multicenter, multi-disease, scan:rescan cardiovascular magnetic resonance data set. Methods: One hundred ten patients (5 disease categories, 5 institutions, 2 scanner manufacturers, and 2 field strengths) underwent scan:rescan cardiovascular magnetic resonance (96% within one week). After identification of the most precise human technique, left ventricular chamber volumes, mass, and ejection fraction were measured by an expert, a trained junior clinician, and a fully automated convolutional neural network trained on 599 independent multicenter disease cases. Scan:rescan coefficient of variation and 1000 bootstrapped 95% CIs were calculated and compared using mixed linear effects models. Results: Clinicians can be confident in detecting a 9% change in left ventricular ejection fraction, with greater than half of coefficient of variation attributable to intraobserver variation. Expert, trained junior, and automated scan:rescan precision were similar (for left ventricular ejection fraction, coefficient of variation 6.1 [5.2%-7.1%], P=0.2581; 8.3 [5.6%-10.3%], P=0.3653; 8.8 [6.1%-11.1%], P=0.8620). Automated analysis was 186× faster than humans (0.07 versus 13 minutes). Conclusions: Automated ML analysis is faster with similar precision to the most precise human techniques, even when challenged with real-world scan:rescan data. Assessment of multicenter, multi-vendor, multi-field strength scan:rescan data (available at www.thevolumesresource.com) permits a generalizable assessment of ML precision and may facilitate direct translation of ML to clinical practice.