Abstract:Introduction: As a biomarker of depression, speech signal has attracted the interest of many researchers due to its characteristics of easy collection and non-invasive. However, subjects' speech variation under different scenes and emotional stimuli, the insufficient amount of depression speech data for deep learning, and the variable length of speech frame-level features have an impact on the recognition performance. Methods: The above problems, this study proposes a multi-task ensemble learning method based on speaker embeddings for depression classification. First, we extract the Mel Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Predictive Coefficients (PLP), and the Filter Bank (FBANK) from the out-domain dataset (CN-Celeb) and train the Resnet x-vector extractor, Time delay neural network (TDNN) x-vector extractor, and i-vector extractor. Then, we extract the corresponding speaker embeddings of fixed length from the depression speech database of the Gansu Provincial Key Laboratory of Wearable Computing. Support Vector Machine (SVM) and Random Forest (RF) are used to obtain the classification results of speaker embeddings in nine speech tasks. To make full use of the information of speech tasks with different scenes and emotions, we aggregate the classification results of nine tasks into new features and then obtain the final classification results by using Multilayer Perceptron (MLP). In order to take advantage of the complementary effects of different features, Resnet x-vectors based on different acoustic features are fused in the ensemble learning method. Results: Experimental results demonstrate that (1) MFCC-based Resnet x-vectors perform best among the nine speaker embeddings for depression detection; (2) interview speech is better than picture descriptions speech, and neutral stimulus is the best among the three emotional valences in the depression recognition task; (3) our multi-task ensemble learning method with MFCC-based Resnet x-vectors can effectively identify depressed patients; (4) in all cases, the combination of MFCC-based Resnet x-vectors and PLP-based Resnet x-vectors in our ensemble learning method achieves the best results, outperforming other literature studies using the depression speech database. Discussion: Our multi-task ensemble learning method with MFCC-based Resnet x-vectors can fuse the depression related information of different stimuli effectively, which provides a new approach for depression detection. The limitation of this method is that speaker embeddings extractors were pre-trained on the out-domain dataset. We will consider using the augmented in-domain dataset for pre-training to improve the depression recognition performance further.

Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Hybrid Network Feature Extraction for Depression Assessment from Speech

Hierarchical Attention Transfer Networks for Depression Assessment from Speech

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

Depression recognition using voice-based pre-training model

Attention-Based Acoustic Feature Fusion Network for Depression Detection

Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

Depression Detection in Speech Using Transformer and Parallel Convolutional Neural Networks

Multi-Head Attention-Based Long Short-Term Memory for Depression Detection From Speech

Automated depression analysis using convolutional neural networks from speech

Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN

Experience speaks on client-server myths & truths.

Deep learning for Depression Recognition from Speech

Attention guided learnable time-domain filterbanks for speech depression detection

WavDepressionNet: Automatic Depression Level Prediction Via Raw Speech Signals

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

[A research on depression recognition based on voice pre-training model]

Fusing features of speech for depression classification based on higher-order spectral analysis

Hierarchical transformer speech depression detection model research based on Dynamic window and Attention merge

Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection