Abstract:At present, although great progress has been made in automatic depression assessment, most of the recent works only concern the audio and video paralinguistic information, rather than the linguistic information from the spoken content. In this work, we argue that beside developing good audio and video features, to build reliable depression detection systems, text-based content features are also of importance to analyse depression-related textual indicators. Furthermore, to improve the performance of automatic depression assessment systems, powerful models, capable of modelling the characteristics of depression embedded in the audio, visual and text descriptors, are also required. This paper proposes new text and video features and hybridizes deep and shallow models for depression estimation and classification from audio, video and text descriptors. The proposed hybrid framework consists of three main parts: 1) A Deep Convolutional Neural Network (DCNN) and Deep Neural Network (DNN) based audio-visual multi-modal depression recognition model for estimating the Patient Health Questionnaire depression scale (PHQ-8); 2) A Paragraph Vector (PV) and Support Vector Machine (SVM) based model for inferring the physical and mental conditions of the individual from the transcripts of the interview; 3) A Random Forest (RF) model for depression classification from the estimated PHQ-8 score and the inferred conditions of the individual. In the PV-SVM model, PV embedding is used to obtain fixed-length feature vectors from transcripts of the answers to the questions associated with psychoanalytic aspects of depression, which are subsequently fed into the SVM classifiers for detecting the presence/absence of the considered psychoanalytic symptoms. To our best knowledge, this approach is the first attempt to apply PV for depression analysis. Besides, we propose a new visual descriptor - Histogram of Displacement Range (HDR) to characterize the displacement and velocity of the facial landmarks in the video segment. Experiments have been carried out on the Audio Visual Emotion Challenge (AVEC2016) depression dataset, they demonstrate that: 1) The proposed hybrid framework effectively improves the accuracies of both depression estimation and depression classification, with an average F1 measure up to 0.746, which is higher than the best result (0.724) of the depression sub-challenge of AVEC2016. 2) HDR obtains better depression recognition performance than Bag-of-Words (BoW) and Motion History Histogram (MHH) features.

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Automatic Depression Prediction Via Cross-Modal Attention-Based Multi-Modal Fusion in Social Networks

Hybrid Network Feature Extraction for Depression Assessment from Speech

Dynamic Facial Features in Positive-Emotional Speech for Identification of Depressive Tendencies

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

Multimodal Measurement of Depression Using Deep Learning Models

Integrating Deep and Shallow Models for Multi-Modal Depression Analysis—Hybrid Architectures

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection

Detect Depression from Communication: How Computer Vision, Signal Processing, and Sentiment Analysis Join Forces

Multimodal Depression Detection: Fusion of Electroencephalography and Paralinguistic Behaviors Using a Novel Strategy for Classifier Ensemble.

Multimodal Prediction of Affective Dimensions Via Fusing Multiple Regression Techniques

A Multi-modal Feature Layer Fusion Model for Assessment of Depression Based on Attention Mechanisms

Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions

Multi-modal Depression Estimation based on Sub-attentional Fusion

Depression Detection Based on Facial Expression, Audio and Gait

Fusing features of speech for depression classification based on higher-order spectral analysis

Attention-Based Acoustic Feature Fusion Network for Depression Detection

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection