Abstract:Background and Objective. Currently, depression is a widespread global issue that imposes a significant burden and disability on individuals, families, and society. Deep learning (DL) has emerged as a valuable approach for automatically detecting depression by extracting cues from audiovisual data and making a diagnosis. PHQ-8 is considered a validated diagnostic tool for depressive disorders in clinical studies, and the objective of this experiment is to improve the accuracy of PHQ-8 prediction. Furthermore, this paper aims to demonstrate the effectiveness of expert knowledge in depression diagnosis and discuss a novel multimodal network architecture. Methods. This research paper focuses on multimodal depression analysis, proposing a flexible parallel transformer (FPT) model capable of extracting data from three distinct modalities (i.e., one video and two audio descriptors). The FPT-Former model incorporates three paths, each using expert-knowledge-based descriptors from one modality as inputs. These descriptors are represented into 32 features by the encoder part of a transformer module, and these features are fused to realize the final regression of PHQ-8 score. The extended distress analysis interview corpus (E-DAIC) is an expansion of WOZ-DAIC which comprises semiclinical interviews intended to assist in the diagnosis of psychological distress conditions. It encompasses a sample size of 275 participants, and in this study, it was utilized to test the model in a way of 10-fold cross-validation. Results. The FPT presented herein achieved comparable performance to the state-of-the-art works, with a root mean square error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58. The ablation experiments demonstrate that the three-modality-fused model outperforms other two-modality-fused and single-modality models. While using a PHQ-8 score threshold of 10, the accuracy of the depression classification is 0.79. Conclusions. Leveraging the strength of expert-knowledge-based multimodal measures and parallel transformer structure, the FPT model exhibits promising performance in depression detection. This model improved the accuracy of depression diagnosis through audio and video, and it also proved the effectiveness of using expert-knowledge in the diagnosis of depression. The traits of flexible structure, high predictive efficiency, and secure privacy protection make our model a promotable intelligent system in mental healthcare.

Depressformer: Leveraging Video Swin Transformer and fine-grained local features for depression scale estimation

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

LSCAformer: Long and short-term cross-attention-aware transformer for depression recognition from video sequences

Hybrid Network Feature Extraction for Depression Assessment from Speech

Dynamic Facial Features in Positive-Emotional Speech for Identification of Depressive Tendencies

DepNet: An automated industrial intelligent system using deep learning for video‐based depression analysis

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation

Depressioner: Facial dynamic representation for automatic depression level prediction

Interpreting Depression From Question-Wise Long-Term Video Recording of SDS Evaluation

Dual‐task enhanced global–local temporal–spatial network for depression recognition from facial videos

Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals

Spectral Representation of Behaviour Primitives for Depression Analysis

DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection

Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction.

FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures

Multimodal Measurement of Depression Using Deep Learning Models

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

WavDepressionNet: Automatic Depression Level Prediction Via Raw Speech Signals

TCEDN: A Lightweight Time-Context Enhanced Depression Detection Network