Abstract:Multimodal schizophrenia assessment systems have gained traction over the last few years. This work introduces a schizophrenia assessment system to discern between prominent symptom classes of schizophrenia and predict an overall schizophrenia severity score. We develop a Vector Quantized Variational Auto-Encoder (VQ-VAE) based Multimodal Representation Learning (MRL) model to produce task-agnostic speech representations from vocal Tract Variables (TVs) and Facial Action Units (FAUs). These representations are then used in a Multi-Task Learning (MTL) based downstream prediction model to obtain class labels and an overall severity score. The proposed framework outperforms the previous works on the multi-class classification task across all evaluation metrics (Weighted F1 score, AUC-ROC score, and Weighted Accuracy). Additionally, it estimates the schizophrenia severity score, a task not addressed by earlier approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a multimodal system that can effectively evaluate schizophrenia symptoms and predict their severity. Specifically, the study aims to: 1. **Distinguish different types of schizophrenia symptoms**: By constructing a multimodal representation learning (MRL) model, extract features from speech and facial expressions to distinguish the main symptom categories (such as strong positive symptoms, strong negative symptoms, and mixed symptoms). 2. **Predict the overall schizophrenia severity score**: Based on the Brief Psychiatric Rating Scale (BPRS), estimate the overall schizophrenia severity score for each subject. ### Research Background Schizophrenia is a complex mental health disorder, showing a variety of symptoms, including hallucinations, delusions, reduced emotional expression, and poverty of speech. These symptoms have a serious impact on the patient's daily life and social functions. Traditional assessment methods rely on the subjective judgment of clinicians and questionnaires, and these methods may have certain limitations and errors. Therefore, developing an automated and objective assessment tool is of great significance. ### Main Contributions The main contributions of this study include: 1. **Self - supervised multimodal speech representation learning**: Proposed a self - supervised multimodal speech representation learning method based on vocal tract variables (TVs) and facial action units (FAUs) to generate task - independent speech representations. 2. **Multi - task learning (MTL) downstream model**: Utilize the above multimodal representations, combined with the multi - task learning framework, to simultaneously perform three - class classification (distinguishing different symptom categories) and regression prediction (estimating severity scores). 3. **Establish a benchmark**: Provide a baseline model for speech - based schizophrenia severity score estimation. ### Method Overview - **Dataset**: Used a multimodal dataset collected in cooperation by the University of Maryland School of Medicine and the University of Maryland, College Park, which contains video and audio recordings. - **Feature extraction**: Extract vocal tract variables (TVs) and facial action units (FAUs) from segmented audio and video segments as low - level feature representations. - **Model architecture**: - **Self - supervised multimodal representation learning**: Adopt the Vector Quantized Variational Auto - Encoder (VQ - VAE) model to generate discrete multimodal latent space representations. - **Multi - task learning**: Combine the latent representations of audio and video through a fusion block and use them for classification and regression prediction in downstream tasks. ### Experimental Results The experimental results show that the proposed multimodal representation learning model outperforms previous work on multi - classification tasks and can effectively predict the schizophrenia severity score. This indicates that a properly trained multimodal representation learning model can generate better representations than task - specific models, and the multi - task learning paradigm helps to improve the performance of classification and regression tasks. ### Conclusions and Future Work This study shows that the multimodal representation learning framework based on vocal tract variables and facial action units has significant advantages in schizophrenia assessment. Future research will further expand to more data sources to improve the generalization ability of the model, and consider adding text modalities to enhance the integrity of multimodal information. ### Formula Example In multi - task learning, the formula for calculating the total loss function \( L_{\text{total}} \) is as follows: \[ L_{\text{total}}=\frac{L_{\text{Classification}}}{2\sigma_1^2}+\log(\sigma_1)+\frac{L_{\text{Regression}}}{2\sigma_2^2}+\log(\sigma_2) \] where \( L_{\text{Classification}} \) and \( L_{\text{Regression}} \) are the loss functions of the classification task and the regression task respectively, and \( \sigma_1 \) and \( \sigma_2 \) are two learnable parameters representing the uncertainty of each task.

Self-supervised Multimodal Speech Representations for the Assessment of Schizophrenia Symptoms

Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion

A multi-modal approach for identifying schizophrenia using cross-modal attention

Multimodal Assessment of Schizophrenia Symptom Severity From Linguistic, Acoustic and Visual Cues

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

A Multimodal Framework for the Assessment of the Schizophrenia Spectrum

Detecting schizophrenia, bipolar disorder, psychosis vulnerability and major depressive disorder from 5 minutes of online-collected speech

Task-voting for schizophrenia spectrum disorders prediction using machine learning across linguistic feature domains

Unaligned Multimodal Sequences for Depression Assessment From Speech

A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts

Deep Multimodal Representations and Classification of First-Episode Psychosis via Live Face Processing

Multimodal Mental Health Digital Biomarker Analysis from Remote Interviews using Facial, Vocal, Linguistic, and Cardiovascular Patterns

Multimodal temporal machine learning for Bipolar Disorder and Depression Recognition

Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

Multimodal Assessment of Schizophrenia and Depression Utilizing Video, Acoustic, Locomotor, Electroencephalographic, and Heart Rate Technology: Protocol for an Observational Study

Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation

Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning

Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions