Abstract:Speech-based assessment of the schizophrenia spectrum has been widely researched over in the recent past. In this study, we develop a deep learning framework to estimate schizophrenia severity scores from speech using a feature fusion approach that fuses articulatory features with different self-supervised speech features extracted from pre-trained audio models. We also propose an auto-encoder-based self-supervised representation learning framework to extract compact articulatory embeddings from speech. Our top-performing speech-based fusion model with Multi-Head Attention (MHA) reduces Mean Absolute Error (MAE) by 9.18% and Root Mean Squared Error (RMSE) by 9.36% for schizophrenia severity estimation when compared with the previous models that combined speech and video inputs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to estimate the severity of schizophrenia through speech. Specifically, the researchers developed a deep - learning framework, using a feature - fusion method (combining pronunciation features with different self - supervised speech features extracted from pre - trained audio models) to estimate the schizophrenia severity score. ### Main Problems and Solutions 1. **Problem Background** - Schizophrenia is a chronic and serious mental health disorder, affecting approximately 24 million people worldwide. - Clinicians use assessment scales (such as the Brief Psychiatric Rating Scale, BPRS) to measure the severity of patients' symptoms. - Due to the diverse symptoms of schizophrenia, especially changes in language expression, speech has become a potential biomarker for detecting and evaluating schizophrenia. 2. **Limitations of Existing Methods** - Early studies mainly relied on traditional features such as statistical acoustic features, MFCCs (Mel - Frequency Cepstral Coefficients), and spectrograms. - In recent years, with the development of self - supervised representation - learning models, such as Wav2Vec2 and WavLM, these models are pre - trained on large - scale corpora and can extract more generalized speech representations. - Nevertheless, existing methods still have certain limitations, especially in cases where it is difficult to obtain multi - modal data. 3. **New Method Proposed in the Paper** - **Self - supervised Pronunciation Representation - Learning Model**: Generate compact pronunciation representations from vocal tract variables (TVs) through VQ - VAE (Vector - Quantized Variational Auto - Encoder). - **Feature - Fusion Model**: Combine self - supervised speech representations and compact pronunciation representations, and use the multi - head attention mechanism (MHA) for feature fusion, thereby estimating the severity of schizophrenia more accurately. 4. **Improvement Effects** - Compared with previous models that combine speech and video inputs, the speech - based fusion model proposed in this study reduces the Mean Absolute Error (MAE) by 9.18% and the Root Mean Square Error (RMSE) by 9.36%. ### Mathematical Formulas - **Delay - Correlation Calculation** \[ r_d^{\text{TV1,TV2}}=\frac{\sum_{t = 0}^{N - d - 1}\text{TV1}[t]\cdot\text{TV2}[t + d]}{N-|d|} \] where \( N \) is the total number of input frames, and \( d \) is the number of delay frames. - **Total Loss Function** \[ L_{\text{Total}}=L_{\text{Reconstruction}}+\beta\cdot L_{\text{Commitment}}+L_{\text{Codebook}} \] where \( \beta \) is a hyperparameter used to control the degree of commitment of the encoder output to the codebook embedding. - **Spearman Rank - Correlation Coefficient** \[ \rho = 1-\frac{6\sum d_i^2}{n(n^2 - 1)} \] where \( d_i \) is the rank difference of each observation, and \( n \) is the sample size. Through these methods, the paper aims to improve the estimation accuracy of the severity of schizophrenia when only using speech data, and demonstrates the advantages of feature fusion and self - supervised learning in this task.

Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion

Hybrid Network Feature Extraction for Depression Assessment from Speech

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Self-supervised Multimodal Speech Representations for the Assessment of Schizophrenia Symptoms

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

A multi-modal approach for identifying schizophrenia using cross-modal attention

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

A Multimodal Framework for the Assessment of the Schizophrenia Spectrum

Multimodal Assessment of Schizophrenia Symptom Severity From Linguistic, Acoustic and Visual Cues

A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

[Psychosis speech recognition algorithm based on deep embedded sparse stacked autoencoder and manifold ensemble]

Fusing features of speech for depression classification based on higher-order spectral analysis

From Sound Perception to Automatic Detection of Schizophrenia: An EEG-Based Deep Learning Approach

Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis

Attention-Like Multimodality Fusion With Data Augmentation for Diagnosis of Mental Disorders Using MRI

Non-Invasive Suicide Risk Prediction Through Speech Analysis

An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis

Multi-Dimension-Embedding-Aware Modality Fusion Transformer for Psychiatric Disorder Clasification

Multi-modal deep learning of functional and structural neuroimaging and genomic data to predict mental illness

Detection of Schizophrenia from EEG Signals using Selected Statistical Moments of MFC Coefficients and Ensemble Learning