Abstract:Depression is a global mental health problem, the worst case of which can lead to suicide. An automatic depression detection system provides great help in facilitating depression self-assessment and improving diagnostic accuracy. In this work, we propose a novel depression detection approach utilizing speech characteristics and linguistic contents from participants' interviews. In addition, we establish an Emotional Audio-Textual Depression Corpus (EATD-Corpus) which contains audios and extracted transcripts of responses from depressed and non-depressed volunteers. To the best of our knowledge, EATD-Corpus is the first and only public depression dataset that contains audio and text data in Chinese. Evaluated on two depression datasets, the proposed method achieves the state-of-the-art performances. The outperforming results demonstrate the effectiveness and generalization ability of the proposed method. The source code and EATD-Corpus are available at <a class="link-external link-https" href="https://github.com/speechandlanguageprocessing/ICASSP2022-Depression" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is automatic depression detection. Specifically, the authors propose a new method for depression detection, which uses the voice characteristics and language content in the participants' interviews to identify depression. In addition, they also establish an Emotional Audio - Textual Depression Corpus (EATD - Corpus) containing audio and text transcripts of depressed and non - depressed volunteers. This is the first publicly available depression dataset containing Chinese audio and text data. ### Main contributions of the paper: 1. **Establishment of EATD - Corpus**: This is a publicly available Chinese depression dataset, containing audio and text transcripts of 162 volunteers who answered three randomly selected emotion - related questions. 2. **Proposing a new method for depression detection**: This method uses the Gated Recurrent Unit (GRU) model and the Bidirectional Long - Short - Term Memory (BiLSTM) model combined with an attention mechanism to extract audio and text features, and performs depression detection through a multi - modal fusion network. ### Method overview: - **Feature extraction**: - **Text features**: Use ELMo to project text transcripts into high - dimensional sentence embeddings. - **Audio features**: Extract Mel - spectrograms from audio, and then use NetVLAD to generate fixed - length audio embeddings. - **Model structure**: - **BiLSTM model**: Used to extract text features, combined with an attention mechanism to emphasize the sentences that contribute the most to depression detection. - **GRU model**: Used to process audio features, summarize audio embeddings to generate audio representations. - **Multi - modal fusion**: Concatenate the feature vectors generated by the GRU and BiLSTM models, assign weights through a modal attention mechanism, and finally generate binary classification labels through a fully - connected layer. ### Experimental results: - **Performance on the DAIC - WoZ dataset**: - Unimodal models (using only audio or text features): - The proposed GRU model has an F1 score of 0.77 on audio features, which is higher than other methods. - The proposed BiLSTM model has an F1 score of 0.83 on text features, close to the best method. - Multi - modal fusion model: - The proposed fusion model has an F1 score of 0.85 on multi - modal features, significantly outperforming other methods. - **Performance on the EATD - Corpus dataset**: - Unimodal models: - The proposed GRU model has an F1 score of 0.66 on audio features, better than other methods. - The proposed BiLSTM model has an F1 score of 0.65 on text features, also better than other methods. - Multi - modal fusion model: - The proposed fusion model has an F1 score of 0.71 on multi - modal features, significantly outperforming other methods. ### Conclusion: The depression detection method proposed in this paper performs well on two different datasets, especially in multi - modal fusion. The release of EATD - Corpus also provides a valuable resource for depression research. In the future, the authors plan to develop an application that allows users to self - detect their depressive state based on this method.

Automatic Depression Detection: An Emotional Audio-Textual Corpus and a GRU/BiLSTM-based Model

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Hybrid Network Feature Extraction for Depression Assessment from Speech

Dynamic Facial Features in Positive-Emotional Speech for Identification of Depressive Tendencies

Hierarchical Attention Transfer Networks for Depression Assessment from Speech

Automatic recognition of depression based on audio and video: A review

A Multimodal Approach for Detection and Assessment of Depression Using Text, Audio and Video

Evaluating Acoustic and Linguistic Features of Detecting Depression Sub-Challenge Dataset

Automatic Depression Recognition by Intelligent Speech Signal Processing: A Systematic Survey

A novel study for depression detecting using audio signals based on graph neural network

Depression Detection and Analysis using Large Language Models on Textual and Audio-Visual Modalities

Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN

A novel automated depression detection technique using text transcript

Multi-Head Attention-Based Long Short-Term Memory for Depression Detection From Speech

Additive Cross-Modal Attention Network (ACMA) for Depression Detection Based on Audio and Textual Features

Depression Scale Recognition from Audio, Visual and Text Analysis

Automatic Depression Detection Using Smartphone-Based Text-Dependent Speech Signals: Deep Convolutional Neural Network Approach

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Depression recognition using voice-based pre-training model