Abstract:Intelligent monitoring systems and affective computing applications have emerged in recent years to enhance healthcare. Examples of these applications include assessment of affective states such as Major Depressive Disorder (MDD). MDD describes the constant expression of certain emotions: negative emotions (low Valence) and lack of interest (low Arousal). High-performing intelligent systems would enhance MDD diagnosis in its early stages. In this paper, we present a new deep neural network architecture, called EmoAudioNet, for emotion and depression recognition from speech. Deep EmoAudioNet learns from the time-frequency representation of the audio signal and the visual representation of its spectrum of frequencies. Our model shows very promising results in predicting affect and depression. It works similarly or outperforms the state-of-the-art methods according to several evaluation metrics on RECOLA and on DAIC-WOZ datasets in predicting arousal, valence, and depression. Code of EmoAudioNet is publicly available on GitHub: <a class="link-external link-https" href="https://github.com/AliceOTHMANI/EmoAudioNet" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: **How to accurately recognize emotions and depressive states through voice signals, and develop a new deep neural network architecture (EmoAudioNet) to achieve this goal.** Specifically, the paper focuses on the following issues: 1. **Emotion recognition problem**: How to predict continuous emotional dimensions (such as arousal and valence) from speech in order to better understand human emotions. 2. **Depressive state assessment problem**: How to use voice signals to automatically identify clinical depressive states (binary depression classification) and the level of depression severity (depression severity level assessment), thereby providing support for mental health assessment. ### Background and Motivation - **The importance of emotions and depression**: Affective computing is of great significance in the healthcare field. For example, recognizing emotions and depressive states through voice can help in the early diagnosis of major depressive disorder (MDD). - **The shortcomings of existing methods**: - **Traditional methods**: Approaches based on handcrafted features - based approaches are effective, but rely on complex feature engineering and have limited performance. - **Deep - learning methods**: Existing deep - learning models mainly focus on information in a single domain (such as the time domain or the frequency domain) and fail to fully utilize the time - frequency representation and visual spectral patterns of voice signals. ### Proposed Solution The paper proposes a new deep neural network architecture **EmoAudioNet**, whose main features include: 1. **Two - stream CNN structure**: - **MFCC - based CNN**: Extract low - frequency features (Mel - Frequency Cepstral Coefficients, MFCC) to capture the spectral characteristics of speech. - **Spectrogram - based CNN**: Extract time - frequency features (spectrogram) to capture the visual spectral patterns of voice signals. 2. **Feature fusion**: Merge MFCC features and spectral features to form a richer representation. 3. **Emotion and depression prediction**: Through training the model, achieve the prediction of arousal, valence, and depressive states. ### Experimental Verification - **Data sets**: - RECOLA data set: Used for emotion recognition experiments, containing multi - modal emotional interaction data in the French context. - DAIC - WOZ data set: Used for depressive state assessment experiments, containing clinical interview recordings and PHQ - 8 scores. - **Experimental results**: - On the RECOLA data set, the Pearson correlation coefficient (PCC) of EmoAudioNet reaches 0.9069 (arousal) and 0.9221 (valence) respectively, significantly outperforming other methods. - On the DAIC - WOZ data set, the depression classification accuracy of EmoAudioNet reaches 73.25%, the F1 is 82% (non - depressed) and 49% (depressed) respectively, and shows a low normalized RMSE (0.18) in the depression severity prediction task. ### Formula Examples - **MFCC feature extraction**: MFCC is a feature calculated after performing short - time Fourier transform (STFT) on the voice signal. The formula is as follows: $$ \text{MFCC}=\text{DCT}(\log(\text{Mel - Spectrum}(X))) $$ where $\text{Mel - Spectrum}(X)$ represents the Mel frequency spectrum, and $\text{DCT}$ represents the discrete cosine transform. - **Spectrogram feature extraction**: The spectrogram is generated by short - time Fourier transform (STFT): $$ S(t, f)=|\text{STFT}(x

Towards Robust Deep Neural Networks for Affect and Depression Recognition from Speech

Hybrid Network Feature Extraction for Depression Assessment from Speech

Automatic Assessment of Depression from Speech Via a Hierarchical Attention Transfer Network and Attention Autoencoders

Hierarchical Attention Transfer Networks for Depression Assessment from Speech

MFCC-based Recurrent Neural Network for automatic clinical depression recognition and assessment from speech

Automated depression analysis using convolutional neural networks from speech

Deep learning for Depression Recognition from Speech

Speech emotion recognition with deep convolutional neural networks

Speech depression recognition based on attentional residual network

Generalisation and Robustness Investigation for Facial and Speech Emotion Recognition Using Bio-Inspired Spiking Neural Networks

AudVowelConsNet: A phoneme-level based deep CNN architecture for clinical depression diagnosis

A novel study for depression detecting using audio signals based on graph neural network

WavDepressionNet: Automatic Depression Level Prediction Via Raw Speech Signals

Depression recognition base on acoustic speech model of Multi-task emotional stimulus

Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks

Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

Depression Scale Recognition from Audio, Visual and Text Analysis

An Ambient Intelligence-based Approach For Longitudinal Monitoring of Verbal and Vocal Depression Symptoms

[Clinical aspects and prognosis of Legionella infections].

Speech Emotion Recognition using Supervised Deep Recurrent System for Mental Health Monitoring

The Verbal and Non Verbal Signals of Depression -- Combining Acoustics, Text and Visuals for Estimating Depression Level