Self-supervised Multimodal Speech Representations for the Assessment of Schizophrenia Symptoms

Gowtham Premananth,Carol Espy-Wilson
2024-11-03
Abstract:Multimodal schizophrenia assessment systems have gained traction over the last few years. This work introduces a schizophrenia assessment system to discern between prominent symptom classes of schizophrenia and predict an overall schizophrenia severity score. We develop a Vector Quantized Variational Auto-Encoder (VQ-VAE) based Multimodal Representation Learning (MRL) model to produce task-agnostic speech representations from vocal Tract Variables (TVs) and Facial Action Units (FAUs). These representations are then used in a Multi-Task Learning (MTL) based downstream prediction model to obtain class labels and an overall severity score. The proposed framework outperforms the previous works on the multi-class classification task across all evaluation metrics (Weighted F1 score, AUC-ROC score, and Weighted Accuracy). Additionally, it estimates the schizophrenia severity score, a task not addressed by earlier approaches.
Audio and Speech Processing,Sound,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a multimodal system that can effectively evaluate schizophrenia symptoms and predict their severity. Specifically, the study aims to: 1. **Distinguish different types of schizophrenia symptoms**: By constructing a multimodal representation learning (MRL) model, extract features from speech and facial expressions to distinguish the main symptom categories (such as strong positive symptoms, strong negative symptoms, and mixed symptoms). 2. **Predict the overall schizophrenia severity score**: Based on the Brief Psychiatric Rating Scale (BPRS), estimate the overall schizophrenia severity score for each subject. ### Research Background Schizophrenia is a complex mental health disorder, showing a variety of symptoms, including hallucinations, delusions, reduced emotional expression, and poverty of speech. These symptoms have a serious impact on the patient's daily life and social functions. Traditional assessment methods rely on the subjective judgment of clinicians and questionnaires, and these methods may have certain limitations and errors. Therefore, developing an automated and objective assessment tool is of great significance. ### Main Contributions The main contributions of this study include: 1. **Self - supervised multimodal speech representation learning**: Proposed a self - supervised multimodal speech representation learning method based on vocal tract variables (TVs) and facial action units (FAUs) to generate task - independent speech representations. 2. **Multi - task learning (MTL) downstream model**: Utilize the above multimodal representations, combined with the multi - task learning framework, to simultaneously perform three - class classification (distinguishing different symptom categories) and regression prediction (estimating severity scores). 3. **Establish a benchmark**: Provide a baseline model for speech - based schizophrenia severity score estimation. ### Method Overview - **Dataset**: Used a multimodal dataset collected in cooperation by the University of Maryland School of Medicine and the University of Maryland, College Park, which contains video and audio recordings. - **Feature extraction**: Extract vocal tract variables (TVs) and facial action units (FAUs) from segmented audio and video segments as low - level feature representations. - **Model architecture**: - **Self - supervised multimodal representation learning**: Adopt the Vector Quantized Variational Auto - Encoder (VQ - VAE) model to generate discrete multimodal latent space representations. - **Multi - task learning**: Combine the latent representations of audio and video through a fusion block and use them for classification and regression prediction in downstream tasks. ### Experimental Results The experimental results show that the proposed multimodal representation learning model outperforms previous work on multi - classification tasks and can effectively predict the schizophrenia severity score. This indicates that a properly trained multimodal representation learning model can generate better representations than task - specific models, and the multi - task learning paradigm helps to improve the performance of classification and regression tasks. ### Conclusions and Future Work This study shows that the multimodal representation learning framework based on vocal tract variables and facial action units has significant advantages in schizophrenia assessment. Future research will further expand to more data sources to improve the generalization ability of the model, and consider adding text modalities to enhance the integrity of multimodal information. ### Formula Example In multi - task learning, the formula for calculating the total loss function \( L_{\text{total}} \) is as follows: \[ L_{\text{total}}=\frac{L_{\text{Classification}}}{2\sigma_1^2}+\log(\sigma_1)+\frac{L_{\text{Regression}}}{2\sigma_2^2}+\log(\sigma_2) \] where \( L_{\text{Classification}} \) and \( L_{\text{Regression}} \) are the loss functions of the classification task and the regression task respectively, and \( \sigma_1 \) and \( \sigma_2 \) are two learnable parameters representing the uncertainty of each task.