Abstract:Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging and joint modeling of multiple discrete and continuous clinical scores presents a promising new paradigm for multi-task problems in Ophthalmology. The bi-channel framework that arises from the Ophthalmic phenomenon of ``interocular asymmetries'' of both eyes (OU) calls for new employment on the SOTA transformer-based models. However, the application of copula models for multiple mixed discrete-continuous labels on deep learning (DL) is challenging. Moreover, the application of advanced large transformer-based models to small medical datasets is challenging due to overfitting and computational resource constraints. To resolve these challenges, we propose OU-CoViT: a novel Copula-Enhanced Bi-Channel Multi-Task Vision Transformers with Dual Adaptation for OU-UWF images, which can i) incorporate conditional correlation information across multiple discrete and continuous labels within a deep learning framework (by deriving the closed form of a novel Copula Loss); ii) take OU inputs subject to both high correlation and interocular asymmetries using a bi-channel model with dual adaptation; and iii) enable the adaptation of large vision transformer (ViT) models to small medical datasets. Solid experiments demonstrate that OU-CoViT significantly improves prediction performance compared to single-channel baseline models with empirical loss. Furthermore, the novel architecture of OU-CoViT allows generalizability and extensions of our dual adaptation and Copula Loss to various ViT variants and large DL models on small medical datasets. Our approach opens up new possibilities for joint modeling of heterogeneous multi-channel input and mixed discrete-continuous clinical scores in medical practices and has the potential to advance AI-assisted clinical decision-making in various medical domains beyond Ophthalmology.

What problem does this paper attempt to address?

This paper attempts to solve the following three main problems: 1. **Modeling of mixed discrete - continuous labels in multi - task learning**: - In ophthalmology applications, it is very important to predict discrete and continuous clinical scores (such as the binary highly myopic state and axial length). However, most of the existing methods only focus on the prediction of a single label and ignore the inherent high correlation between these labels. For this reason, the paper proposes a new Copula - enhanced loss function, which can simultaneously capture the conditional dependence structure between multiple discrete and continuous labels. 2. **Modeling of retinal asymmetry in binocular images**: - Existing studies rarely consider the "retinal asymmetry" in binocular (OU) images, that is, the asymmetric features between the left - and right - eye images. This asymmetry means that the binocular images of the same patient may contain inconsistent information about the myopic state. The paper introduces a two - channel model, which can simultaneously retain the common features of the binocular images and independently learn the heterogeneous information in each eye. 3. **Application of large - scale Transformer models on small - scale medical data sets**: - Due to the difficulty and high cost of obtaining and annotating medical images, the existing large - scale visual Transformers (ViT) and their variants face the problems of over - fitting and computational resource limitations when applied to small - scale medical data sets. The paper adopts transfer learning techniques such as low - rank adaptation (LoRA), enabling pre - trained large - scale models to be fine - tuned on small - scale data sets, thereby reducing over - fitting and computational burden. To solve the above problems, the paper proposes a new framework named **OU - CoViT**: a Copula - enhanced two - channel multi - task visual Transformer with a dual - adaptation mechanism. Specifically, OU - CoViT contains three key innovations: 1. **Copula loss for 4 - dimensional mixed classification - regression tasks**: - By deriving a closed - form expression of the joint density, a computationally feasible Copula loss function is designed to capture the conditional dependence structure between labels. 2. **Novel two - channel architecture**: - This architecture combines a dual - adaptation mechanism and a shared backbone network and can simultaneously handle the heterogeneity and high correlation in multi - channel inputs. 3. **Efficient ViT application**: - Using the low - rank adaptation (LoRA) technique, the application problem of large - scale Transformer variants on small - scale medical data sets is solved, making it more efficient and easier to implement. The paper verifies the superior performance of OU - CoViT on the ultra - wide - field fundus (UWF) data set through experiments and shows its potential in multi - task learning and AI - assisted clinical decision - making.

OU-CoViT: Copula-Enhanced Bi-Channel Multi-Task Vision Transformers with Dual Adaptation for OU-UWF Images

OUCopula: Bi-Channel Multi-Label Copula-Enhanced Adapter-Based CNN for Myopia Screening Based on OU-UWF Images

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

MIL-ViT: A Multiple Instance Vision Transformer for Fundus Image Classification

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Model long-range dependencies for multi-modality and multi-view retinopathy diagnosis through transformers

CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

CA-ViT: Contour-Guided and Augmented Vision Transformers to Enhance Glaucoma Classification Using Fundus Images

Multi-label classification of retinal disease via a novel vision transformer model

Unified Visual Transformer Compression

Deep Relation Transformer for Diagnosing Glaucoma With Optical Coherence Tomography and Visual Field Function

MedViT: A robust vision transformer for generalized medical image classification

Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

Ophthalmic Biomarker Detection with Parallel Prediction of Transformer and Convolutional Architecture

MMMViT: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities

Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

Ophthalmic Biomarker Detection Using Ensembled Vision Transformers and Knowledge Distillation

Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract

Leveraging Multimodal Fusion for Enhanced Diagnosis of Multiple Retinal Diseases in Ultra-wide OCTA

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

Harnessing the power of longitudinal medical imaging for eye disease prognosis using Transformer-based sequence modeling