Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments

Xu Zhang,Xiangcheng Zhang,Weisi Chen,Chenlong Li,Chengyuan Yu
DOI: https://doi.org/10.1038/s41598-024-60278-1
IF: 4.6
2024-04-26
Scientific Reports
Abstract:Depression, a pervasive global mental disorder, profoundly impacts daily lives. Despite numerous deep learning studies focused on depression detection through speech analysis, the shortage of annotated bulk samples hampers the development of effective models. In response to this challenge, our research introduces a transfer learning approach for detecting depression in speech, aiming to overcome constraints imposed by limited resources. In the context of feature representation, we obtain depression-related features by fine-tuning wav2vec 2.0. By integrating 1D-CNN and attention pooling structures, we generate advanced features at the segment level, thereby enhancing the model's capability to capture temporal relationships within audio frames. In the realm of prediction results, we integrate LSTM and self-attention mechanisms. This incorporation assigns greater weights to segments associated with depression, thereby augmenting the model's discernment of depression-related information. The experimental results indicate that our model has achieved impressive F1 scores, reaching 79% on the DAIC-WOZ dataset and 90.53% on the CMDC dataset. It outperforms recent baseline models in the field of speech-based depression detection. This provides a promising solution for effective depression detection in low-resource environments.
multidisciplinary sciences
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the technical challenges of detecting depression through speech analysis in low-resource environments. Specifically, the paper focuses on solving the following issues: 1. **Data Scarcity**: Although deep learning has widespread applications in the field of depression detection, the lack of annotated data limits the development of effective models. The researchers overcome this limitation through transfer learning methods. 2. **Feature Representation**: By fine-tuning the wav2vec 2.0 model, features related to depression are extracted from speech. Combining 1D-CNN and attention pooling structures generates high-level features, enhancing the model's ability to capture temporal relationships within audio frames. 3. **Prediction Results**: Integrating LSTM and self-attention mechanisms into the model assigns higher weights to segments related to depression, thereby improving the model's ability to recognize depression-related information. Through these methods, the proposed model achieved F1 scores of 79% and 90.53% on the DAIC-WOZ and CMDC datasets, respectively, outperforming recent baseline models and providing an effective depression detection solution in low-resource environments.