Abstract:Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input-output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.

Cluster-to-Predict Affect Contours from Speech

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Emotional Speech Clustering Based Robust Speaker Recognition System

Affective Burst Detection from Speech using Kernel-fusion Dilated Convolutional Neural Networks

Speech Emotion Recognition Based on Clustering Assistance

Simplified Deformation Compensation for Emotional Speaker Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster

Cluster-Level Contrastive Learning for Emotion Recognition in Conversations

Speech, Head, and Eye-based Cues for Continuous Affect Prediction

Continuous Affect Prediction Using Eye Gaze and Speech

Simultaneous prediction of valence / arousal and emotion categories and its application in an HRC scenario

CAGE: Circumplex Affect Guided Expression Inference

Deep temporal clustering features for speech emotion recognition

Improving Emotion Recognition Accuracy with Personalized Clustering

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Deep-seeded Clustering for Unsupervised Valence-Arousal Emotion Recognition from Physiological Signals

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

A Multi-Task, Multi-Modal Approach for Predicting Categorical and Dimensional Emotions

Speechformer-CTC: Sequential Modeling of Depression Detection with Speech Temporal Classification

Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech