Abstract:Depression is a severe psychological condition that affects millions of people worldwide. As depression has received more attention in recent years, it has become imperative to develop automatic methods for detecting depression. Although numerous machine learning methods have been proposed for estimating the levels of depression via audio, visual, and audiovisual emotion sensing, several challenges still exist. For example, it is difficult to extract long-term temporal context information from long sequences of audio and visual data, and it is also difficult to select and fuse useful multi-modal information or features effectively. In addition, how to include other information or tasks to enhance the estimation accuracy is also one of the challenges. In this study, we propose a multi-modal adaptive fusion transformer network for estimating the levels of depression. Transformer-based models have achieved state-of-the-art performance in language understanding and sequence modeling. Thus, the proposed transformer-based network is utilized to extract long-term temporal context information from uni-modal audio and visual data in our work. This is the first transformer-based approach for depression detection. We also propose an adaptive fusion method for adaptively fusing useful multi-modal features. Furthermore, inspired by current multi-task learning work, we also incorporate an auxiliary task (depression classification) to enhance the main task of depression level regression (estimation). The effectiveness of the proposed method has been validated on a public dataset (AVEC 2019 Detecting Depression with AI Sub-challenge) in terms of the PHQ-8 scores. Experimental results indicate that the proposed method achieves better performance compared with currently state-of-the-art methods. Our proposed method achieves a concordance correlation coefficient (CCC) of 0.733 on AVEC 2019 which is 6.2% higher than the accuracy (CCC = 0.696) of the state-of-the-art method.

Multi Task Sequence Learning for Depression Scale Prediction from Video

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

Hybrid Network Feature Extraction for Depression Assessment from Speech

Dynamic Facial Features in Positive-Emotional Speech for Identification of Depressive Tendencies

Unaligned Multimodal Sequences for Depression Assessment From Speech

Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions

Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction.

Multi-level Attention network using text, audio and video for Depression Prediction

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Design of polydiacetylene-phospholipid supramolecules for enhanced stability and sensitivity.

Multi-Head Attention-Based Long Short-Term Memory for Depression Detection From Speech

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

Depression Scale Recognition from Audio, Visual and Text Analysis

Multimodal Measurement of Depression Using Deep Learning Models

A Multi-Modal Hierarchical Recurrent Neural Network for Depression Detection