Abstract:Depression stands as one of the most widespread psychological disorders and has garnered increasing attention. Currently, how to effectively achieve automatic multimodal depression detection for assisting doctors in early diagnosis of depression, has become an important and challenging issue. To address this issue, this work proposes Transformer-based feature enhancement networks for multimodal depression detection. The proposed method effectively integrates three modalities including video, audio and remote photoplethysmographic (rPPG) signals for multimodal depression detection, in which the rPPG modality is introduced as an additional modality for enhancing the effectiveness of multimodal depression detection. The proposed method consists of three key steps: multimodal feature extraction for video, audio and rPPG modalities, Transformer-based multimodal feature enhancement (TMFE), and graph fusion networks (GFN) based multimodal fusion and depression prediction. More specially, in the stage of multimodal feature extraction, for video and audio modalities we employ deep convolutional neural networks (CNN) to extract the corresponding high-level video and audio features, respectively. For rPPG modality, we adopt a short-time end-to-end rPPG estimation framework to extract the rPPG signal values.The TMFE module stacks multiple Transformers such as the inter-modal, intra-modal, and tri-modal Transformers to jointly capture the dynamics and relationships within and between modalities for each time-step of input sequences. The GFN module is designed to effectively fuse the obtained feature representations from different modalities while maintaining the interactions between them simultaneously. Finally, the obtained shared feature representations of all modalities are fed into a multilayer perceptrons (MLP) network to implement final depression detection tasks. Extensive experiments are conducted on two public datasets such as AVEC2013 and AVEC2014, and experimental results demonstrate the validity of the proposed method on depression detection tasks.

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Hybrid Network Feature Extraction for Depression Assessment from Speech

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Automatic Depression Prediction Via Cross-Modal Attention-Based Multi-Modal Fusion in Social Networks

Hierarchical Attention Transfer Networks for Depression Assessment from Speech

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection

A time-frequency channel attention and vectorization network for automatic depression level prediction

Design of polydiacetylene-phospholipid supramolecules for enhanced stability and sensitivity.

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Multi-Head Attention-Based Long Short-Term Memory for Depression Detection From Speech

Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals

Multi-modal Depression Estimation based on Sub-attentional Fusion

Attention-Based Acoustic Feature Fusion Network for Depression Detection

[Fetal phono-electrocardiography. II. Sensitivity of the fetus to some drugs at various periods of pregnancy in physiological and pathological conditions].

Unaligned Multimodal Sequences for Depression Assessment From Speech

WavDepressionNet: Automatic Depression Level Prediction Via Raw Speech Signals

Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction.

Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection