Automatic Depression Detection of Mobile-Based Text-dependent Speech Signals Using a Deep CNN Approach: A Prospective Cohort Study (Preprint)

Ahyoung Kim,Eun Hye Jang,Seung-Hwan Lee,Kwang-Yeon Choi,Jeon Gyu Park,Hyun-Chool Shin
DOI: https://doi.org/10.2196/preprints.34474
2021-10-25
Abstract:BACKGROUND In the future, automatic diagnosis of depression based on speech could complement mental health treatment methods. Previous studies have reported that acoustic properties can be used to recognize depression, including mel-frequency cepstrum coefficients (MFCCs) applied to speech recognition. However, there are few studies in which these characteristics allow differential diagnosis of patients with depressive disorder. OBJECTIVE This paper proposes a framework to help with automatic depression detection in a mobile environment where speech data can be easily obtained. Specifically, we recorded speech data by performing a predefined text-based speech reading task on mobile, investigated whether the recorded data can screen for depression, and proposed a deep learning-based framework that helps in automatic depression detection. METHODS We recruited 125 patients who met the criteria for major depressive disorder (MDD) and 113 healthy controls without current or past mental illness. Participants' voices were recorded on smart-phone while performing the task of reading predefined text-based sentences. We investigated the differences in the voice characteristics between MDD and healthy control groups using statistical analysis. We also investigated the possibility of automatic depression detection using the proposed log mel (LM) spectrogram-based deep Convolutional Neural Networks (CNN) architectures and machine learning models. RESULTS We found that there were statistically discernable differences between MDD and control groups in the MFCC features extracted through the utterances of reading predefined text-based sentences. Moreover, the best accuracies achieved with LM spectrogram-based CNN and softmax classifier on the speech data are 80.00% accuracy. Our results show that the deep-learned acoustic characteristics lead to better performances of classifiers than those using the conventional approach. CONCLUSIONS Conclusions: In conclusion, this study suggests that the analysis of speech data recorded while reading text-dependent sentences could help predict depression status automatically by capturing characteristics of depression. Our method can contribute to an approach that allows individuals to easily and automatically assess their depressive state anytime, anywhere, without the need for experts to conduct psychological assessments on-site.
What problem does this paper attempt to address?