Artificial intelligence for diagnosing exudative age-related macular degeneration
Chaerim Kang,Jui-En Lo,Helen Zhang,Sueko M Ng,John C Lin,Ingrid U Scott,Jayashree Kalpathy-Cramer,Su-Hsun Alison Liu,Paul B Greenberg
DOI: https://doi.org/10.1002/14651858.cd015522.pub2
IF: 8.4
2024-10-18
Cochrane Database of Systematic Reviews
Abstract:Age‐related macular degeneration (AMD) is a retinal disorder characterized by central retinal (macular) damage. Approximately 10% to 20% of non‐exudative AMD cases progress to the exudative form, which may result in rapid deterioration of central vision. Individuals with exudative AMD (eAMD) need prompt consultation with retinal specialists to minimize the risk and extent of vision loss. Traditional methods of diagnosing ophthalmic disease rely on clinical evaluation and multiple imaging techniques, which can be resource‐consuming. Tests leveraging artificial intelligence (AI) hold the promise of automatically identifying and categorizing pathological features, enabling the timely diagnosis and treatment of eAMD. To determine the diagnostic accuracy of artificial intelligence (AI) as a triaging tool for exudative age‐related macular degeneration (eAMD). We searched CENTRAL, MEDLINE, Embase, three clinical trials registries, and Data Archiving and Networked Services (DANS) for gray literature. We did not restrict searches by language or publication date. The date of the last search was April 2024. Included studies compared the test performance of algorithms with that of human readers to detect eAMD on retinal images collected from people with AMD who were evaluated at eye clinics in community or academic medical centers, and who were not receiving treatment for eAMD when the images were taken. We included algorithms that were either internally or externally validated or both. Pairs of review authors independently extracted data and assessed study quality using the Quality Assessment of Diagnostic Accuracy Studies‐2 (QUADAS‐2) tool with revised signaling questions. For studies that reported more than one set of performance results, we extracted only one set of diagnostic accuracy data per study based on the last development stage or the optimal algorithm as indicated by the study authors. For two‐class algorithms, we collected data from the 2x2 table whenever feasible. For multi‐class algorithms, we first consolidated data from all classes other than eAMD before constructing the corresponding 2x2 tables. Assuming a common positivity threshold applied by the included studies, we chose random‐effects, bivariate logistic models to estimate summary sensitivity and specificity as the primary performance metrics. We identified 36 eligible studies that reported 40 sets of algorithm performance data, encompassing over 16,000 participants and 62,000 images. We included 28 studies (78%) that reported 31 algorithms with performance data in the meta‐analysis. The remaining nine studies (25%) reported eight algorithms that lacked usable performance data; we reported them in the qualitative synthesis. Study characteristics and risk of bias Most studies were conducted in Asia, followed by Europe, the USA, and collaborative efforts spanning multiple countries. Most studies identified study participants from the hospital setting, while others used retinal images from public repositories; a few studies did not specify image sources. Based on four of the 36 studies reporting demographic information, the age of the study participants ranged from 62 to 82 years. The included algorithms used various retinal image types as model input, such as optical coherence tomography (OCT) images (N = 15), fundus images (N = 6), and multi‐modal imaging (N = 7). The predominant core method used was deep neural networks. All studies that reported externally validated algorithms were at high risk of bias mainly due to potential selection bias from either a two‐gate design or the inappropriate exclusion of potentially eligible retinal images (or participants). Findings Only three of the 40 included algorithms were externally validated (7.5%, 3/40). The summary sensitivity and specificity were 0.94 (95% confidence interval (CI) 0.90 to 0.97) and 0.99 (95% CI 0.76 to 1.00), respectively, when compared to human graders (3 studies; 27,872 images; low‐certainty evidence). The prevalence of images with eAMD ranged from 0.3% to 49%. Twenty‐eight algorithms were reportedly either internally validated (20%, 8/40) or tested on a development set (50%, 20/40); the pooled sensitivity and specificity were 0.93 (95% CI 0.89 to 0.96) and 0.96 (95% CI 0.94 to 0.98), respectively, when compared to human graders (28 studies; 33,409 images; low‐certainty evidence). We did not identify significant sources of heterogeneity among these 28 algorithms. Although algorithms using OCT images appeared more homogeneous and had the highest summary specificity (0.97, 95% CI 0.93 to 0.98), they were not superior to algorithms using fundus images alone (0.94, 95% CI 0.89 to 0.97) or multimodal imaging (0.96, 95% CI 0.88 to 0.99; P for meta‐regression = 0.239). The median prevalence of images with eAMD was 30% (interquartile range [IQR] 22% to 39%). We did not include eight studies that described nine algori -Abstract Truncated-
medicine, general & internal