Bridging AI and Clinical Practice: Integrating Automated Sleep Scoring Algorithm with Uncertainty-Guided Physician Review
Michal Bechny,Giuliana Monachino,Luigi Fiorillo,Julia van der Meer,Markus Schmidt,Claudio Bassetti,Athina Tzovara,Francesca Faraci
DOI: https://doi.org/10.2147/nss.s455649
2024-05-28
Nature and Science of Sleep
Abstract:Michal Bechny, 1, 2 Giuliana Monachino, 1, 2 Luigi Fiorillo, 2 Julia van der Meer, 3 Markus H Schmidt, 3, 4 Claudio LA Bassetti, 3 Athina Tzovara, 1, 3 Francesca D Faraci 2 1 Institute of Computer Science, University of Bern, Bern, Switzerland; 2 Institute of Digital Technologies for Personalized Healthcare (Meditech), University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland; 3 Department of Neurology, University Hospital of Bern, Bern, Switzerland; 4 Ohio Sleep Medicine Institute, Dublin, OH, USA Correspondence: Michal Bechny, Institute of Digital Technologies for Personalized Healthcare, East Campus USI-SUPSI, Via la Santa 1, CH-6962 Lugano-Viganello, Lugano, Switzerland, Tel +41 (0)58 666 65 10, Email Purpose: This study aims to enhance the clinical use of automated sleep-scoring algorithms by incorporating an uncertainty estimation approach to efficiently assist clinicians in the manual review of predicted hypnograms, a necessity due to the notable inter-scorer variability inherent in polysomnography (PSG) databases. Our efforts target the extent of review required to achieve predefined agreement levels, examining both in-domain (ID) and out-of-domain (OOD) data, and considering subjects' diagnoses. Patients and Methods: A total of 19,578 PSGs from 13 open-access databases were used to train U-Sleep, a state-of-the-art sleep-scoring algorithm. We leveraged a comprehensive clinical database of an additional 8832 PSGs, covering a full spectrum of ages (0– 91 years) and sleep-disorders, to refine the U-Sleep, and to evaluate different uncertainty-quantification approaches, including our novel confidence network. The ID data consisted of PSGs scored by over 50 physicians, and the two OOD sets comprised recordings each scored by a unique senior physician. Results: U-Sleep demonstrated robust performance, with Cohen's kappa (K) at 76.2% on ID and 73.8– 78.8% on OOD data. The confidence network excelled at identifying uncertain predictions, achieving AUROC scores of 85.7% on ID and 82.5– 85.6% on OOD data. Independently of sleep-disorder status, statistical evaluations revealed significant differences in confidence scores between aligning vs discording predictions, and significant correlations of confidence scores with classification performance metrics. To achieve κ ≥ 90% with physician intervention, examining less than 29.0% of uncertain epochs was required, substantially reducing physicians' workload, and facilitating near-perfect agreement. Conclusion: Inter-scorer variability limits the accuracy of the scoring algorithms to ~80%. By integrating an uncertainty estimation with U-Sleep, we enhance the review of predicted hypnograms, to align with the scoring taste of a responsible physician. Validated across ID and OOD data and various sleep-disorders, our approach offers a strategy to boost automated scoring tools' usability in clinical settings. Keywords: automated sleep scoring, uncertainty quantification, explainable AI, polysomnography, sleep medicine Sleep, often dubbed as the third pillar of health alongside diet and exercise, plays a critical role in our well-being. Polysomnography (PSG), a comprehensive sleep monitoring technique, captures detailed biosignals – primarily the electroencephalogram (EEG), the electrooculogram (EOG), and the electromyogram (EMG). Adhering to guidelines of American Academy of Sleep Medicine (AASM), 1 physicians score PSG recordings into specific sleep stages, on 30-second windows ( epochs ). Such structured scoring, called hypnogram , divides sleep into five distinct stages: W, REM, N1, N2, and N3, each representing a unique physiological state. 2 The proportions of sleep stages, as well as patterns in their transitions, are basic indicators of sleep health, 3,4 and also biomarkers of certain disorders. 5–7 While manual scoring remains the gold standard, the procedure may be labor-intensive, often demanding up to 2 hours for a comprehensive evaluation of a single PSG recording. 8 Research into automatic sleep scoring, which aims to support the manual scoring of physicians by computational algorithms, dates back to the 1960s. 9 Recent advancements in Artificial Intelligence (AI) have significantly improved automatic scoring solutions, especially those based on Machine and Deep Learning (ML/DL) methodologies. Notably, the U-Sleep algorithm introduced by Perslev et al, 10 and further investigated by Fiorillo & Monachino et al, 11 stands at the forefront due to its balance between performance rivaling human -Abstract Truncated-
neurosciences,clinical neurology