S4Sleep: Elucidating the design space of deep-learning-based sleep stage classification models

Tiezhi Wang,Nils Strodthoff
2024-08-21
Abstract:Scoring sleep stages in polysomnography recordings is a time-consuming task plagued by significant inter-rater variability. Therefore, it stands to benefit from the application of machine learning algorithms. While many algorithms have been proposed for this purpose, certain critical architectural decisions have not received systematic exploration. In this study, we meticulously investigate these design choices within the broad category of encoder-predictor architectures. We identify robust architectures applicable to both time series and spectrogram input representations. These architectures incorporate structured state space models as integral components and achieve statistically significant performance improvements compared to state-of-the-art approaches on the extensive Sleep Heart Health Study dataset. We anticipate that the architectural insights gained from this study along with the refined methodology for architecture search demonstrated herein will not only prove valuable for future research in sleep staging but also hold relevance for other time series annotation tasks.
Machine Learning,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that some key design decisions in automatic sleep staging have not been systematically explored. Specifically, the paper focuses on the following aspects: 1. **Choice of input representation**: Currently, there is no consensus on the most suitable input representation. State - of - the - art models usually rely on spectrograms as input representations, but these spectrograms do not preserve all the complexity of the input signal like the original time series. Although spectrograms encode useful inductive biases, competing models based on the original time series trained on large - scale datasets may eventually outperform spectrogram - based models because they can learn more subtle representations. 2. **Architecture design choices**: Even for a given input representation, architecture design choices have rarely been systematically studied. Besides the commonly used Recurrent Neural Networks (RNN) and the recent Transformer models, this research should also include Structured State - Space Sequence (S4) models, which are known for their ability to capture long - range dependencies in time - series data and have been successfully applied to other physiological time - series data, such as electrocardiogram data. ### Main contributions of the paper 1. **Systematic procedure**: Through the example of sleep staging, a systematic procedure is designed to identify the optimal model architecture applicable to long - time - series labeling tasks, covering the encoder - predictor architectures of most literature methods. 2. **Optimal model architecture**: Based on this procedure, the optimal model architectures applicable to raw time - series and spectrogram inputs as well as single - channel and multi - channel configurations are identified. These models outperform existing methods on both the Sleep EDF dataset and the large - scale Sleep Heart Health Study dataset without further adjusting hyperparameters, which proves the robustness of its findings. - **Raw time - series input**: The Structured State - Space Model (S4) is a very effective sequence encoder, and pre - processing acts directly on the time - series representation (encoded by a shallow CNN). - **Spectrogram input**: Similarly, the Structured State - Space Model (S4) is used as a sequence encoder, but pre - processing acts directly on the spectrogram. The optimal prediction model is the Transformer model. 3. **Prediction head**: To keep the scope of the study limited, only prediction heads with a global / local average pooling layer followed by a linear layer (with five output neurons) are explored. ### Method overview - **Single - cycle prediction model**: Designs considering a shallow CNN encoder or directly passing the input to the predictor are considered. Using a shallow CNN is a common choice for processing raw waveforms and spectrogram audio processing. It allows mapping the original input to an input representation more suitable for the predictor while increasing the complexity slightly and optionally (in the case of stride convolution) reducing the time resolution of the signal. - **Multi - cycle prediction model**: Different strategies are considered, including the optimal architecture of the single - cycle prediction model as a cycle encoder, and a shallow CNN for raw time - series input to provide a hidden representation with time - downsampling. ### Dataset - **Sleep - EDF**: Contains 197 sleep recordings from two subsets: Sleep Cassette (153 recordings) and Sleep Telemetry (44 recordings). All recordings include Fpz - Cz and Pz - Oz EEG channels as well as EOG channels, with a sampling rate of 100 Hz. - **SHHS**: Contains 5,463 recordings from two overnight visits (Visit 1 and Visit 2), each recording including two EEG channels, two EOG channels and one EMG channel, as well as other signals. ### Training and performance evaluation - **Training**: The Focal Loss is used as the loss function to deal with the problem of unbalanced label distribution. Training is carried out using the AdamW optimizer with a fixed effective batch size of 64, which is achieved through gradient accumulation. The model is trained for 50 epochs on SEDF and 30 epochs on SHHS. - **Performance evaluation**: The main target metric is the macro F1 - score, calculated as the mean of the F1 - scores of individual labels. In addition, some other metrics are also reported, such as the F1 - scores of individual labels, the overall prediction accuracy, and the macro - average area under the Receiver Operating Characteristic curve (ma