AI integration improves breast cancer screening in a real-world, retrospective cohort study
Frazer,H. M.,Pena-Solorzano,C. A.,Kwok,C. F.,Elliott,M.,Chen,Y.,Wang,C.,the BRAIx team,,Lippey,J.,Hopper,J.,Brotchie,P.,Carneiro,G.,McCarthy,D. J.
DOI: https://doi.org/10.1101/2022.11.23.22282646
2022-11-29
MedRxiv
Abstract:Background: Artificial intelligence (AI) readers, derived from applying deep learning models to medical image analysis, hold great promise for improving population breast cancer screening. However, previous evaluations of AI readers for breast cancer screening have mostly been conducted using cancer-enriched cohorts and have lacked assessment of the potential use of AI readers alongside radiologists in multi-reader screening programs. Here, we present a new AI reader for detecting breast cancer from mammograms in a large-scale population screening setting, and a novel analysis of the potential for human-AI reader collaboration in a well-established, high-performing population screening program. We evaluated the performance of our AI reader and AI-integrated screening scenarios using a two-year, real-world, population dataset from Victoria, Australia, a screening program in which two radiologists independently assess each episode and disagreements are arbitrated by a third radiologist. Methods: We used a retrospective full-field digital mammography image and non-image dataset comprising 808,318 episodes, 577,576 clients and 3,404,326 images in the period 2013 to 2019. Screening episodes from 2016, 2017 and 2018 were sequential population cohorts containing 752,609 episodes, 565,087 clients and 3,169,322 images. The dataset was split by screening date into training, development, and testing sets. All episodes from 2017 and 2018 were allocated to the testing set (509,109 episodes; 3,651 screen-detected cancer episodes). Eight distinct AI models were trained on subsets of the training set (which included a validation set) and combined into our ensemble AI reader. Operating points were selected using the development set. We evaluated our AI reader on our testing set and on external datasets previously unseen by our models. Findings: The AI reader outperformed the mean individual radiologist on this large retrospective testing dataset with an area under the receiver operator characteristic curve of 0.92 (95% CI 0.91, 0.92). The AI reader generalised well across screening round, client demographics, device manufacturer and cancer type, and achieved state-of-the-art performance on external datasets compared to recently published AI readers. Our simulations of AI-integrated screening scenarios demonstrated that a reader-replacement human-AI collaborative system could achieve better sensitivity and specificity (82.6%, 96.1%) compared to the current two-reader consensus system (79.9%, 96.0%), with reduced human reading workload and cost. Our band-pass AI-integrated scenario also enabled both higher sensitivity and specificity (80.6%, 96.2%) with larger reductions in human reading workload and cost. Interpretation: This study demonstrated that human-AI collaboration in a population breast cancer screening program has potential to improve accuracy and lower radiologist workload and costs in real world screening programs. The next stage of validation is to undertake prospective studies that can also assess the effects of the AI systems on human performance and behaviour.