Abstract:Purpose: Over the last 2 years, the artificial intelligence (AI) community has presented several automatic screening tools for coronavirus disease 2019 (COVID-19) based on chest radiography (CXR), with reported accuracies often well over 90%. However, it has been noted that many of these studies have likely suffered from dataset bias, leading to overly optimistic results. The purpose of this study was to thoroughly investigate to what extent biases have influenced the performance of a range of previously proposed and promising convolutional neural networks (CNNs), and to determine what performance can be expected with current CNNs on a realistic and unbiased dataset. Methods: Five CNNs for COVID-19 positive/negative classification were implemented for evaluation, namely VGG19, ResNet50, InceptionV3, DenseNet201, and COVID-Net. To perform both internal and cross-dataset evaluations, four datasets were created. The first dataset Valencian Region Medical Image Bank (BIMCV) followed strict reverse transcriptase-polymerase chain reaction (RT-PCR) test criteria and was created from a single reliable open access databank, while the second dataset (COVIDxB8) was created through a combination of six online CXR repositories. The third and fourth datasets were created by combining the opposing classes from the BIMCV and COVIDxB8 datasets. To decrease inter-dataset variability, a pre-processing workflow of resizing, normalization, and histogram equalization were applied to all datasets. Classification performance was evaluated on unseen test sets using precision and recall. A qualitative sanity check was performed by evaluating saliency maps displaying the top 5%, 10%, and 20% most salient segments in the input CXRs, to evaluate whether the CNNs were using relevant information for decision making. In an additional experiment and to further investigate the origin of potential dataset bias, all pixel values outside the lungs were set to zero through automatic lung segmentation before training and testing. Results: When trained and evaluated on the single online source dataset (BIMCV), the performance of all CNNs is relatively low (precision: 0.65-0.72, recall: 0.59-0.71), but remains relatively consistent during external evaluation (precision: 0.58-0.82, recall: 0.57-0.72). On the contrary, when trained and internally evaluated on the combinatory datasets, all CNNs performed well across all metrics (precision: 0.94-1.00, recall: 0.77-1.00). However, when subsequently evaluated cross-dataset, results dropped substantially (precision: 0.10-0.61, recall: 0.04-0.80). For all datasets, saliency maps revealed the CNNs rarely focus on areas inside the lungs for their decision-making. However, even when setting all pixel values outside the lungs to zero, classification performance does not change and dataset bias remains. Conclusions: Results in this study confirm that when trained on a combinatory dataset, CNNs tend to learn the origin of the CXRs rather than the presence or absence of disease, a behavior known as short-cut learning. The bias is shown to originate from differences in overall pixel values rather than embedded text or symbols, despite consistent image pre-processing. When trained on a reliable, and realistic single-source dataset in which non-lung pixels have been masked, CNNs currently show limited sensitivity (<70%) for COVID-19 infection in CXR, questioning their use as a reliable automatic screening tool.

Generalisation challenges in deep learning models for medical imagery: insights from external validation of COVID-19 classifiers

Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset

Deep learning-based COVID-19 pneumonia classification using chest CT images: model generalizability

Systematic investigation into generalization of COVID-19 CT deep learning models with Gabor ensemble for lung involvement scoring

Deep Learning-based Multi-Class COVID-19 Classification with X-ray Images

A retrospective study of deep learning generalization across two centers and multiple models of X-ray devices using COVID-19 chest-X rays

A Generalizable Artificial Intelligence Model for COVID-19 Classification Task Using Chest X-ray Radiographs: Evaluated Over Four Clinical Datasets with 15,097 Patients

Automatic coronavirus disease 2019 diagnosis based on chest radiography and deep learning - Success story or dataset bias?

Deep learning-based Covid-19 diagnosis: a thorough assessment with a focus on generalization capabilities

Evaluation of Contemporary Convolutional Neural Network Architectures for Detecting COVID-19 from Chest Radiographs

Detection of Severe Lung Infection on Chest Radiographs of COVID-19 Patients: Robustness of AI Models across Multi-Institutional Data

Leveraging deep transfer learning and explainable AI for accurate COVID-19 diagnosis: Insights from a multi-national chest CT scan study

Image enhancement techniques on deep learning approaches for automated diagnosis of COVID-19 features using CXR images

Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images

Integrated ensemble CNN and explainable AI for COVID-19 diagnosis from CT scan and X-ray images

Virtual imaging trials improved the transparency and reliability of AI systems in COVID-19 imaging

Fine-Tuning Convolutional Neural Networks for COVID-19 Detection from Chest X-ray Images

Covid-19 Imaging Tools: How Big Data is Big?

Optimal hyperparameter selection of deep learning models for COVID-19 chest X-ray classification

Generalization in medical AI: a perspective on developing scalable models

Generalizable disease detection using model ensemble on chest X-ray images