Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Zhang,Zhang,Ni,Wei,Yang,Jin,Huang,Liang,Zhang,Li,Ding,Zhang,Wang

DOI: https://doi.org/10.3390/s24123714

IF: 3.9

2024-06-08

Sensors

Abstract:Depression is a major psychological disorder with a growing impact worldwide. Traditional methods for detecting the risk of depression, predominantly reliant on psychiatric evaluations and self-assessment questionnaires, are often criticized for their inefficiency and lack of objectivity. Advancements in deep learning have paved the way for innovations in depression risk detection methods that fuse multimodal data. This paper introduces a novel framework, the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN), designed to amalgamate auditory, visual, and textual cues for a comprehensive analysis of depression risk. Our approach encompasses three dedicated branches—Audio Branch, Video Branch, and Text Branch—each responsible for extracting salient features from the corresponding modality. These features are subsequently fused through a multimodal fusion (MMF) module, yielding a robust feature vector that feeds into a predictive modeling layer. To further our research, we devised an emotion elicitation paradigm based on two distinct tasks—reading and interviewing—implemented to gather a rich, sensor-based depression risk detection dataset. The sensory equipment, such as cameras, captures subtle facial expressions and vocal characteristics essential for our analysis. The research thoroughly investigates the data generated by varying emotional stimuli and evaluates the contribution of different tasks to emotion evocation. During the experiment, the AVTF-TBN model has the best performance when the data from the two tasks are simultaneously used for detection, where the F1 Score is 0.78, Precision is 0.76, and Recall is 0.81. Our experimental results confirm the validity of the paradigm and demonstrate the efficacy of the AVTF-TBN model in detecting depression risk, showcasing the crucial role of sensor-based data in mental health detection.

engineering, electrical & electronic,instruments & instrumentation,chemistry, analytical

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the inefficiency and lack of objectivity in depression risk detection. Traditional methods mainly rely on psychiatrists' evaluations and self - assessment questionnaires, and these methods are often criticized for their inefficiency and lack of objectivity. By proposing a new framework - the Audio, Video and Text Fusion Three - Branch Network (AVTF - TBN), this paper aims to combine auditory, visual and textual cues for comprehensive depression risk analysis. This method can not only improve the accuracy of depression risk detection, but also provide a more objective assessment method, thus overcoming the limitations of traditional methods. Specifically, the model extracts significant features from the corresponding modalities through three specialized branches respectively, and then fuses these features through the Multi - Modal Fusion (MMF) module to generate a powerful feature vector, and finally inputs it into the prediction modeling layer for depression risk prediction. In addition, the researchers designed an emotion - elicitation paradigm based on reading and interview tasks to collect rich sensor data for depression risk detection. The experimental results show that when using data from two tasks for detection, the AVTF - TBN model shows the best performance, with an F1 score of 0.78, a precision rate of 0.76, and a recall rate of 0.81, which proves the effectiveness of this model in depression risk detection and the crucial role of sensor data in mental health detection.

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

Automatic Depression Prediction Via Cross-Modal Attention-Based Multi-Modal Fusion in Social Networks

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Design of polydiacetylene-phospholipid supramolecules for enhanced stability and sensitivity.

Hybrid Network Feature Extraction for Depression Assessment from Speech

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

Multimodal Measurement of Depression Using Deep Learning Models

[Fetal phono-electrocardiography. II. Sensitivity of the fetus to some drugs at various periods of pregnancy in physiological and pathological conditions].

Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Attention-Based Acoustic Feature Fusion Network for Depression Detection

Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection

Textual-dominated Multimodal Depression Detection

Enhancing depression detection: A multimodal approach with text extension and content fusion

End-to-end multimodal system for depression detection from online recordings

A Depression Detection Method Based on Multi-Modal Feature Fusion Using Cross-Attention

Multi-level Attention network using text, audio and video for Depression Prediction

Unaligned Multimodal Sequences for Depression Assessment From Speech

Feature-level fusion approaches based on multimodal EEG data for depression recognition