Abstract:Ocular diseases can have debilitating consequences on visual acuity if left untreated, necessitating early and accurate diagnosis to improve patients' quality of life. Although the contemporary clinical prognosis involving fundus screening is a cost-effective method for detecting ocular abnormalities, however, it is time-intensive due to limited resources and expert ophthalmologists. While computer-aided detection, including traditional machine learning and deep learning, has been employed for enhanced prognosis from fundus images, conventional deep learning models often face challenges due to limited global modeling ability, inducing bias and suboptimal performance on unbalanced datasets. Presently, most studies on ocular disease detection focus on cataract detection or diabetic retinopathy severity prediction, leaving a myriad of vision-impairing conditions unexplored. Minimal research has been conducted utilizing deep models for identifying diverse ocular abnormalities from fundus images, with limited success. The study leveraged the capabilities of four Swin Transformer models (Swin-T, Swin-S, Swin-B, and Swin-L) for detecting various significant ocular diseases (including Cataracts, Hypertensive Retinopathy, Diabetic Retinopathy, Myopia, and Age-Related Macular Degeneration) from fundus images of the ODIR dataset. Swin Transformer models, confining self-attention to local windows while enabling cross-window interactions, demonstrated superior performance and computational efficiency. Upon assessment across three specific ODIR test sets, utilizing metrics such as AUC, F1-score, Kappa score, and a composite metric representing an average of these three (referred to as the final score), all Swin models exhibited superior performance metric scores than those documented in contemporary studies. The Swin-L model, in particular, achieved final scores of 0.8501, 0.8211, and 0.8616 on the Off-site, On-site, and Balanced ODIR test sets, respectively. An external validation on a Retina dataset further substantiated the generalizability of Swin models, with the models reporting final scores of 0.9058 (Swin-T), 0.92907 (Swin-S), 0.95917 (Swin-B), and 0.97042 (Swin-L). The results, corroborated by statistical analysis, underline the consistent and stable performance of Swin models across varied datasets, emphasizing their potential as reliable tools for multi-ocular disease detection from fundus images, thereby aiding in the early diagnosis and intervention of ocular abnormalities.

SwinSight: a hierarchical vision transformer using shifted windows to leverage aerial image classification

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Towards improved fundus disease detection using Swin Transformers

SwinVid: Enhancing Video Object Detection Using Swin Transformer

SwinHCST: a deep learning network architecture for scene classification of remote sensing images based on improved CNN and Transformer

SPT-Swin: A Shifted Patch Tokenization Swin Transformer for Image Classification

Swin Transformer for Robust Differentiation of Real and Synthetic Images: Intra- and Inter-Dataset Analysis

Spectral Swin Transformer Network for Hyperspectral Image Classification

HEAL-SWIN: A Vision Transformer On The Sphere

Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery

SwinSOD: Salient object detection using swin-transformer

Shifted Windows Transformers for Medical Image Quality Assessment

SWIN transformer based contrastive self-supervised learning for animal detection and classification

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness Extraction

SWFormer: Stochastic Windows Convolutional Transformer for Hybrid Modality Hyperspectral Classification

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Vision Transformer with Sparse Scan Prior

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection