Abstract:Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.

MTDAN: A Lightweight Multi-Scale Temporal Difference Attention Networks for Automated Video Depression Detection

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Automatic Depression Prediction Via Cross-Modal Attention-Based Multi-Modal Fusion in Social Networks

TCEDN: A Lightweight Time-Context Enhanced Depression Detection Network

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Encoding Visual Behaviors with Attentive Temporal Convolution for Depression Prediction

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

A Multi-Frame Rate Network with Attention Mechanism for Depression Severity Estimation.

TDCA-Net: Time-Domain Channel Attention Network for Depression Detection

Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Automatic diagnosis of depression based on attention mechanism and feature pyramid model

Dual Attention and Element Recalibration Networks for Automatic Depression Level Prediction

MDDR: Multi-modal Dual-Attention Aggregation for Depression Recognition

DepNet: An automated industrial intelligent system using deep learning for video‐based depression analysis

Towards Automatic Depression Detection: A BiLSTM/1D CNN-Based Model

Automatic Depression Detection via Learning and Fusing Features from Visual Cues

TAMFN: Time-Aware Attention Multimodal Fusion Network for Depression Detection

Dual‐task enhanced global–local temporal–spatial network for depression recognition from facial videos

Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction.

DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection

MDN: A Deep Maximization-Differentiation Network for Spatio-Temporal Depression Detection