A Frame-based Attention Interpretation Method for Relevant Acoustic Feature Extraction in Long Speech Depression Detection

Qingkun Deng,Saturnino Luz,Sofia de la Fuente Garcia
2024-06-07
Abstract:Speech-based depression detection tools could help early screening of depression. Here, we address two issues that may hinder the clinical practicality of such tools: segment-level labelling noise and a lack of model interpretability. We propose a speech-level Audio Spectrogram Transformer to avoid segment-level labelling. We observe that the proposed model significantly outperforms a segment-level model, providing evidence for the presence of segment-level labelling noise in audio modality and the advantage of longer-duration speech analysis for depression detection. We introduce a frame-based attention interpretation method to extract acoustic features from prediction-relevant waveform signals for interpretation by clinicians. Through interpretation, we observe that the proposed model identifies reduced loudness and F0 as relevant signals of depression, which aligns with the speech characteristics of depressed patients documented in clinical studies.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address two major issues faced by speech-based depression detection tools in clinical applications: 1. **Segment-level label noise**: Current methods typically divide long audio into multiple segments for processing and label each segment with an overall tag (such as depressed or non-depressed). However, this approach may lead to noise because some speech segments of depressed patients may not contain depression-related information but are labeled as "depressed," thus affecting the accuracy of model predictions. 2. **Lack of model interpretability**: Although deep neural networks (DNNs) perform well in depression detection, their prediction results are difficult to interpret, limiting their application in clinical practice. Therefore, the researchers propose a new model based on the attention mechanism to improve model interpretability and assist clinicians in understanding the model decision process by extracting acoustic features related to depression. The researchers designed a sentence-level speech analysis model and introduced a frame-level attention explanation method to identify specific acoustic features related to depression detection. Experimental results show that the new model improves the accuracy of depression detection while avoiding segment-level label noise and can effectively extract signal features related to depression, such as reduced loudness and fundamental frequency (F0). These findings help enhance the reliability and practicality of speech-based depression screening tools in clinical practice.