1078 Performance of USleep Algorithm to a Better Than “Gold-Standard” Polysomnogram Validation Data Set

Umaer Hanif,Guillaume Jubien,Alyssa Cairns,Tammie Radke,Vincent Mysliwiec
DOI: https://doi.org/10.1093/sleep/zsae067.01078
IF: 6.313
2024-04-20
SLEEP
Abstract:Abstract Introduction Algorithm based PSG scoring is increasingly used in clinical practice and research. Currently, there is no accepted PSG validation data set to ensure algorithms are developed from an accepted standard; thus algorithm-based scoring has inherent inaccuracies. The purpose of this study was to evaluate the inter-rater agreement between three experienced sleep technicians to develop a PSG validation data set with a high interscorer reliability and compare performance of the USleep algorithm to the PSG validation data set. Methods One-hundred de-identified PSGs from a clinical database with 55 males and 45 females were independently scored by three different technicians and each record was quality controlled by a lead technician. The data set encompassed all sleep-related events: sleep stages, apneas, hypopneas, desaturations, arousals, and PLMs. Consensus annotations were computed when at least two technicians agreed upon a given event. The performance of each technician was compared against the consensus annotations. USleep was evaluated on the validation data set and the performance with respect to sleep stages was compared against the consensus annotations. Results The inter-rater agreement was 96.0% for sleep stages across all epochs, which did not greatly differ between N1 (88.3%), N2 (97.3%), N3 (94.2%) and REM sleep (98.1%). The inter-rater agreement across all records was 88.9% for arousals, 84.0% for obstructive apneas, 80.2% for central apneas, 76.4% for mixed apneas, 89.4% for hypopneas, 94.9% for desaturations, and 81.3% for PLMs. Comparing USleep to the consensus yielded an accuracy of 78.3%, with differences between N1 (46.2%) and N2 (76.8%), but not N3 (99.3%) and REM sleep (92.1%). Conclusion This data set has high accuracy for sleep stages. The USleep algorithm showed poor performance for N1 and moderate for N2, but high for N3 and REM. Frequently scored PSG measures to include arousals, hypopneas and obstructive apneas had high degrees of accuracy; however, this decreased for central and mixed apneas. This dataset can serve as a benchmark for developing and validating algorithms in PSG scoring for sleep stages and certain PSG measures. Further development is required to provide basis for a comprehensive PSG validation data set. Support (if any)
neurosciences,clinical neurology
What problem does this paper attempt to address?