Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion

Gowtham Premananth,Carol Espy-Wilson
2024-11-09
Abstract:Speech-based assessment of the schizophrenia spectrum has been widely researched over in the recent past. In this study, we develop a deep learning framework to estimate schizophrenia severity scores from speech using a feature fusion approach that fuses articulatory features with different self-supervised speech features extracted from pre-trained audio models. We also propose an auto-encoder-based self-supervised representation learning framework to extract compact articulatory embeddings from speech. Our top-performing speech-based fusion model with Multi-Head Attention (MHA) reduces Mean Absolute Error (MAE) by 9.18% and Root Mean Squared Error (RMSE) by 9.36% for schizophrenia severity estimation when compared with the previous models that combined speech and video inputs.
Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to estimate the severity of schizophrenia through speech. Specifically, the researchers developed a deep - learning framework, using a feature - fusion method (combining pronunciation features with different self - supervised speech features extracted from pre - trained audio models) to estimate the schizophrenia severity score. ### Main Problems and Solutions 1. **Problem Background** - Schizophrenia is a chronic and serious mental health disorder, affecting approximately 24 million people worldwide. - Clinicians use assessment scales (such as the Brief Psychiatric Rating Scale, BPRS) to measure the severity of patients' symptoms. - Due to the diverse symptoms of schizophrenia, especially changes in language expression, speech has become a potential biomarker for detecting and evaluating schizophrenia. 2. **Limitations of Existing Methods** - Early studies mainly relied on traditional features such as statistical acoustic features, MFCCs (Mel - Frequency Cepstral Coefficients), and spectrograms. - In recent years, with the development of self - supervised representation - learning models, such as Wav2Vec2 and WavLM, these models are pre - trained on large - scale corpora and can extract more generalized speech representations. - Nevertheless, existing methods still have certain limitations, especially in cases where it is difficult to obtain multi - modal data. 3. **New Method Proposed in the Paper** - **Self - supervised Pronunciation Representation - Learning Model**: Generate compact pronunciation representations from vocal tract variables (TVs) through VQ - VAE (Vector - Quantized Variational Auto - Encoder). - **Feature - Fusion Model**: Combine self - supervised speech representations and compact pronunciation representations, and use the multi - head attention mechanism (MHA) for feature fusion, thereby estimating the severity of schizophrenia more accurately. 4. **Improvement Effects** - Compared with previous models that combine speech and video inputs, the speech - based fusion model proposed in this study reduces the Mean Absolute Error (MAE) by 9.18% and the Root Mean Square Error (RMSE) by 9.36%. ### Mathematical Formulas - **Delay - Correlation Calculation** \[ r_d^{\text{TV1,TV2}}=\frac{\sum_{t = 0}^{N - d - 1}\text{TV1}[t]\cdot\text{TV2}[t + d]}{N-|d|} \] where \( N \) is the total number of input frames, and \( d \) is the number of delay frames. - **Total Loss Function** \[ L_{\text{Total}}=L_{\text{Reconstruction}}+\beta\cdot L_{\text{Commitment}}+L_{\text{Codebook}} \] where \( \beta \) is a hyperparameter used to control the degree of commitment of the encoder output to the codebook embedding. - **Spearman Rank - Correlation Coefficient** \[ \rho = 1-\frac{6\sum d_i^2}{n(n^2 - 1)} \] where \( d_i \) is the rank difference of each observation, and \( n \) is the sample size. Through these methods, the paper aims to improve the estimation accuracy of the severity of schizophrenia when only using speech data, and demonstrates the advantages of feature fusion and self - supervised learning in this task.