Abstract:Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.

Multimodal and Multiresolution Depression Detection from Speech and Facial Landmark Features

Hybrid Network Feature Extraction for Depression Assessment from Speech

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions

Depression Scale Recognition from Audio, Visual and Text Analysis

Depression Severity Estimation from Multiple Modalities

Facial Geometry and Speech Analysis for Depression Detection

Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks

End-to-end multimodal system for depression detection from online recordings

Multimodal Measurement of Depression Using Deep Learning Models

The Verbal and Non Verbal Signals of Depression -- Combining Acoustics, Text and Visuals for Estimating Depression Level

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Topic Modeling Based Multi-modal Depression Detection

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Depression Detection and Analysis using Large Language Models on Textual and Audio-Visual Modalities

Fusing features of speech for depression classification based on higher-order spectral analysis

Unaligned Multimodal Sequences for Depression Assessment From Speech

MFCC-based Recurrent Neural Network for automatic clinical depression recognition and assessment from speech