Depression recognition using voice-based pre-training model

Xiangsheng Huang,Fang Wang,Yuan Gao,Yilong Liao,Wenjing Zhang,Li Zhang,Zhenrong Xu
DOI: https://doi.org/10.1038/s41598-024-63556-0
IF: 4.6
2024-06-05
Scientific Reports
Abstract:The early screening of depression is highly beneficial for patients to obtain better diagnosis and treatment. While the effectiveness of utilizing voice data for depression detection has been demonstrated, the issue of insufficient dataset size remains unresolved. Therefore, we propose an artificial intelligence method to effectively identify depression. The wav2vec 2.0 voice-based pre-training model was used as a feature extractor to automatically extract high-quality voice features from raw audio. Additionally, a small fine-tuning network was used as a classification model to output depression classification results. Subsequently, the proposed model was fine-tuned on the DAIC-WOZ dataset and achieved excellent classification results. Notably, the model demonstrated outstanding performance in binary classification, attaining an accuracy of 0.9649 and an RMSE of 0.1875 on the test set. Similarly, impressive results were obtained in multi-classification, with an accuracy of 0.9481 and an RMSE of 0.3810. The wav2vec 2.0 model was first used for depression recognition and showed strong generalization ability. The method is simple, practical, and applicable, which can assist doctors in the early screening of depression.
multidisciplinary sciences
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of early screening for depression. Specifically, the authors propose a pre-trained model based on speech signals (wav2vec 2.0) to automatically extract high-quality speech features and use a small fine-tuning network for depression classification. This method addresses the problems of insufficient dataset size, reliance on expert knowledge for handcrafted features, and the time-consuming nature of feature engineering in existing methods. ### Main Contributions 1. **Overview of Depression Recognition Research**: Discusses current research progress, limitations, and points out future research directions. 2. **Proposed AI Method**: Effectively identifies depression through speech signals, improving diagnostic accuracy and treatment efficiency. 3. **Utilization of wav2vec 2.0 Pre-trained Model**: Automatically extracts high-quality speech features, simplifying the feature extraction process and reducing reliance on handcrafted features. 4. **Using Only Speech Data**: Demonstrates that using only speech data can achieve excellent performance, avoiding the privacy risks associated with complex multimodal data. ### Experimental Results Experiments conducted on the DAIC-WOZ dataset show that the accuracy for the binary classification task reaches 0.9649, with an RMSE of 0.1875; for the multi-class classification task, the accuracy is 0.9481, with an RMSE of 0.3810. These results indicate that the method has high performance in depression recognition. ### Conclusion This paper proposes a simple and practical method that can assist doctors in the early screening of depression, achieving significant progress, especially in the processing of speech signals.