Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program

Paisan Raumviboonsuk,Jonathan Krause,Peranut Chotcomwongse,Rory Sayres,Rajiv Raman,Kasumi Widner,Bilson J L Campana,Sonia Phene,Kornwipa Hemarat,Mongkol Tadarati,Sukhum Silpa-Archa,Jirawut Limwattanayingyong,Chetan Rao,Oscar Kuruvilla,Jesse Jung,Jeffrey Tan,Surapong Orprayoon,Chawawat Kangwanwongpaisan,Ramase Sukumalpaiboon,Chainarong Luengchaichawang,Jitumporn Fuangkaew,Pipat Kongsap,Lamyong Chualinpha,Sarawuth Saree,Srirut Kawinpanitan,Korntip Mitvongsa,Siriporn Lawanasakol,Chaiyasit Thepchatri,Lalita Wongpichedchai,Greg S Corrado,Lily Peng,Dale R Webster
DOI: https://doi.org/10.1038/s41746-019-0099-8
2019-04-10
Abstract:Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.
What problem does this paper attempt to address?