Abstract:Background and Objective. Currently, depression is a widespread global issue that imposes a significant burden and disability on individuals, families, and society. Deep learning (DL) has emerged as a valuable approach for automatically detecting depression by extracting cues from audiovisual data and making a diagnosis. PHQ-8 is considered a validated diagnostic tool for depressive disorders in clinical studies, and the objective of this experiment is to improve the accuracy of PHQ-8 prediction. Furthermore, this paper aims to demonstrate the effectiveness of expert knowledge in depression diagnosis and discuss a novel multimodal network architecture. Methods. This research paper focuses on multimodal depression analysis, proposing a flexible parallel transformer (FPT) model capable of extracting data from three distinct modalities (i.e., one video and two audio descriptors). The FPT-Former model incorporates three paths, each using expert-knowledge-based descriptors from one modality as inputs. These descriptors are represented into 32 features by the encoder part of a transformer module, and these features are fused to realize the final regression of PHQ-8 score. The extended distress analysis interview corpus (E-DAIC) is an expansion of WOZ-DAIC which comprises semiclinical interviews intended to assist in the diagnosis of psychological distress conditions. It encompasses a sample size of 275 participants, and in this study, it was utilized to test the model in a way of 10-fold cross-validation. Results. The FPT presented herein achieved comparable performance to the state-of-the-art works, with a root mean square error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58. The ablation experiments demonstrate that the three-modality-fused model outperforms other two-modality-fused and single-modality models. While using a PHQ-8 score threshold of 10, the accuracy of the depression classification is 0.79. Conclusions. Leveraging the strength of expert-knowledge-based multimodal measures and parallel transformer structure, the FPT model exhibits promising performance in depression detection. This model improved the accuracy of depression diagnosis through audio and video, and it also proved the effectiveness of using expert-knowledge in the diagnosis of depression. The traits of flexible structure, high predictive efficiency, and secure privacy protection make our model a promotable intelligent system in mental healthcare.

Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Automatic Depression Prediction Via Cross-Modal Attention-Based Multi-Modal Fusion in Social Networks

Hybrid Network Feature Extraction for Depression Assessment from Speech

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

A Depression Detection Method Based on Multi-Modal Feature Fusion Using Cross-Attention

Multimodal Measurement of Depression Using Deep Learning Models

Multi Fine-Grained Fusion Network for Depression Detection

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

TAMFN: Time-Aware Attention Multimodal Fusion Network for Depression Detection

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Multimodal Depression Detection based on Factorized Representation

Feature-level fusion approaches based on multimodal EEG data for depression recognition

FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures

Textual-dominated Multimodal Depression Detection

An adaptive multi-graph neural network with multimodal feature fusion learning for MDD detection

MS$^{2}$-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection

A depression detection model based on multimodal graph neural network