Abstract:Background and Objective. Currently, depression is a widespread global issue that imposes a significant burden and disability on individuals, families, and society. Deep learning (DL) has emerged as a valuable approach for automatically detecting depression by extracting cues from audiovisual data and making a diagnosis. PHQ-8 is considered a validated diagnostic tool for depressive disorders in clinical studies, and the objective of this experiment is to improve the accuracy of PHQ-8 prediction. Furthermore, this paper aims to demonstrate the effectiveness of expert knowledge in depression diagnosis and discuss a novel multimodal network architecture. Methods. This research paper focuses on multimodal depression analysis, proposing a flexible parallel transformer (FPT) model capable of extracting data from three distinct modalities (i.e., one video and two audio descriptors). The FPT-Former model incorporates three paths, each using expert-knowledge-based descriptors from one modality as inputs. These descriptors are represented into 32 features by the encoder part of a transformer module, and these features are fused to realize the final regression of PHQ-8 score. The extended distress analysis interview corpus (E-DAIC) is an expansion of WOZ-DAIC which comprises semiclinical interviews intended to assist in the diagnosis of psychological distress conditions. It encompasses a sample size of 275 participants, and in this study, it was utilized to test the model in a way of 10-fold cross-validation. Results. The FPT presented herein achieved comparable performance to the state-of-the-art works, with a root mean square error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58. The ablation experiments demonstrate that the three-modality-fused model outperforms other two-modality-fused and single-modality models. While using a PHQ-8 score threshold of 10, the accuracy of the depression classification is 0.79. Conclusions. Leveraging the strength of expert-knowledge-based multimodal measures and parallel transformer structure, the FPT model exhibits promising performance in depression detection. This model improved the accuracy of depression diagnosis through audio and video, and it also proved the effectiveness of using expert-knowledge in the diagnosis of depression. The traits of flexible structure, high predictive efficiency, and secure privacy protection make our model a promotable intelligent system in mental healthcare.

Disentangled-Multimodal Privileged Knowledge Distillation for Depression Recognition with Incomplete Multimodal Data

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Hybrid Network Feature Extraction for Depression Assessment from Speech

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition

Multimodal Measurement of Depression Using Deep Learning Models

Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Hypergraph Neural Network for Multimodal Depression Recognition

[Clinical aspects and prognosis of Legionella infections].

Decoupled Multimodal Distilling for Emotion Recognition

Dual‐task enhanced global–local temporal–spatial network for depression recognition from facial videos

Depression Diagnosis and Analysis via Multimodal Multi-order Factor Fusion

Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech

Multimodal Depression Detection based on Factorized Representation

A depression detection model based on multimodal graph neural network

FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures

Deep Neural Networks for Depression Recognition Based on 2D and 3D Facial Expressions Under Emotional Stimulus Tasks

Design of polydiacetylene-phospholipid supramolecules for enhanced stability and sensitivity.

Textual-dominated Multimodal Depression Detection