Automatic Depression Detection: An Emotional Audio-Textual Corpus and a GRU/BiLSTM-based Model

Ying Shen,Huiyu Yang,Lin Lin
DOI: https://doi.org/10.48550/arXiv.2202.08210
2022-02-15
Abstract:Depression is a global mental health problem, the worst case of which can lead to suicide. An automatic depression detection system provides great help in facilitating depression self-assessment and improving diagnostic accuracy. In this work, we propose a novel depression detection approach utilizing speech characteristics and linguistic contents from participants' interviews. In addition, we establish an Emotional Audio-Textual Depression Corpus (EATD-Corpus) which contains audios and extracted transcripts of responses from depressed and non-depressed volunteers. To the best of our knowledge, EATD-Corpus is the first and only public depression dataset that contains audio and text data in Chinese. Evaluated on two depression datasets, the proposed method achieves the state-of-the-art performances. The outperforming results demonstrate the effectiveness and generalization ability of the proposed method. The source code and EATD-Corpus are available at <a class="link-external link-https" href="https://github.com/speechandlanguageprocessing/ICASSP2022-Depression" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing,Artificial Intelligence,Sound,Quantitative Methods
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is automatic depression detection. Specifically, the authors propose a new method for depression detection, which uses the voice characteristics and language content in the participants' interviews to identify depression. In addition, they also establish an Emotional Audio - Textual Depression Corpus (EATD - Corpus) containing audio and text transcripts of depressed and non - depressed volunteers. This is the first publicly available depression dataset containing Chinese audio and text data. ### Main contributions of the paper: 1. **Establishment of EATD - Corpus**: This is a publicly available Chinese depression dataset, containing audio and text transcripts of 162 volunteers who answered three randomly selected emotion - related questions. 2. **Proposing a new method for depression detection**: This method uses the Gated Recurrent Unit (GRU) model and the Bidirectional Long - Short - Term Memory (BiLSTM) model combined with an attention mechanism to extract audio and text features, and performs depression detection through a multi - modal fusion network. ### Method overview: - **Feature extraction**: - **Text features**: Use ELMo to project text transcripts into high - dimensional sentence embeddings. - **Audio features**: Extract Mel - spectrograms from audio, and then use NetVLAD to generate fixed - length audio embeddings. - **Model structure**: - **BiLSTM model**: Used to extract text features, combined with an attention mechanism to emphasize the sentences that contribute the most to depression detection. - **GRU model**: Used to process audio features, summarize audio embeddings to generate audio representations. - **Multi - modal fusion**: Concatenate the feature vectors generated by the GRU and BiLSTM models, assign weights through a modal attention mechanism, and finally generate binary classification labels through a fully - connected layer. ### Experimental results: - **Performance on the DAIC - WoZ dataset**: - Unimodal models (using only audio or text features): - The proposed GRU model has an F1 score of 0.77 on audio features, which is higher than other methods. - The proposed BiLSTM model has an F1 score of 0.83 on text features, close to the best method. - Multi - modal fusion model: - The proposed fusion model has an F1 score of 0.85 on multi - modal features, significantly outperforming other methods. - **Performance on the EATD - Corpus dataset**: - Unimodal models: - The proposed GRU model has an F1 score of 0.66 on audio features, better than other methods. - The proposed BiLSTM model has an F1 score of 0.65 on text features, also better than other methods. - Multi - modal fusion model: - The proposed fusion model has an F1 score of 0.71 on multi - modal features, significantly outperforming other methods. ### Conclusion: The depression detection method proposed in this paper performs well on two different datasets, especially in multi - modal fusion. The release of EATD - Corpus also provides a valuable resource for depression research. In the future, the authors plan to develop an application that allows users to self - detect their depressive state based on this method.