Towards Robust Deep Neural Networks for Affect and Depression Recognition from Speech

Alice Othmani,Daoud Kadoch,Kamil Bentounes,Emna Rejaibi,Romain Alfred,Abdenour Hadid
DOI: https://doi.org/10.48550/arXiv.1911.00310
2020-11-19
Abstract:Intelligent monitoring systems and affective computing applications have emerged in recent years to enhance healthcare. Examples of these applications include assessment of affective states such as Major Depressive Disorder (MDD). MDD describes the constant expression of certain emotions: negative emotions (low Valence) and lack of interest (low Arousal). High-performing intelligent systems would enhance MDD diagnosis in its early stages. In this paper, we present a new deep neural network architecture, called EmoAudioNet, for emotion and depression recognition from speech. Deep EmoAudioNet learns from the time-frequency representation of the audio signal and the visual representation of its spectrum of frequencies. Our model shows very promising results in predicting affect and depression. It works similarly or outperforms the state-of-the-art methods according to several evaluation metrics on RECOLA and on DAIC-WOZ datasets in predicting arousal, valence, and depression. Code of EmoAudioNet is publicly available on GitHub: <a class="link-external link-https" href="https://github.com/AliceOTHMANI/EmoAudioNet" rel="external noopener nofollow">this https URL</a>
Human-Computer Interaction,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: **How to accurately recognize emotions and depressive states through voice signals, and develop a new deep neural network architecture (EmoAudioNet) to achieve this goal.** Specifically, the paper focuses on the following issues: 1. **Emotion recognition problem**: How to predict continuous emotional dimensions (such as arousal and valence) from speech in order to better understand human emotions. 2. **Depressive state assessment problem**: How to use voice signals to automatically identify clinical depressive states (binary depression classification) and the level of depression severity (depression severity level assessment), thereby providing support for mental health assessment. ### Background and Motivation - **The importance of emotions and depression**: Affective computing is of great significance in the healthcare field. For example, recognizing emotions and depressive states through voice can help in the early diagnosis of major depressive disorder (MDD). - **The shortcomings of existing methods**: - **Traditional methods**: Approaches based on handcrafted features - based approaches are effective, but rely on complex feature engineering and have limited performance. - **Deep - learning methods**: Existing deep - learning models mainly focus on information in a single domain (such as the time domain or the frequency domain) and fail to fully utilize the time - frequency representation and visual spectral patterns of voice signals. ### Proposed Solution The paper proposes a new deep neural network architecture **EmoAudioNet**, whose main features include: 1. **Two - stream CNN structure**: - **MFCC - based CNN**: Extract low - frequency features (Mel - Frequency Cepstral Coefficients, MFCC) to capture the spectral characteristics of speech. - **Spectrogram - based CNN**: Extract time - frequency features (spectrogram) to capture the visual spectral patterns of voice signals. 2. **Feature fusion**: Merge MFCC features and spectral features to form a richer representation. 3. **Emotion and depression prediction**: Through training the model, achieve the prediction of arousal, valence, and depressive states. ### Experimental Verification - **Data sets**: - RECOLA data set: Used for emotion recognition experiments, containing multi - modal emotional interaction data in the French context. - DAIC - WOZ data set: Used for depressive state assessment experiments, containing clinical interview recordings and PHQ - 8 scores. - **Experimental results**: - On the RECOLA data set, the Pearson correlation coefficient (PCC) of EmoAudioNet reaches 0.9069 (arousal) and 0.9221 (valence) respectively, significantly outperforming other methods. - On the DAIC - WOZ data set, the depression classification accuracy of EmoAudioNet reaches 73.25%, the F1 is 82% (non - depressed) and 49% (depressed) respectively, and shows a low normalized RMSE (0.18) in the depression severity prediction task. ### Formula Examples - **MFCC feature extraction**: MFCC is a feature calculated after performing short - time Fourier transform (STFT) on the voice signal. The formula is as follows: $$ \text{MFCC}=\text{DCT}(\log(\text{Mel - Spectrum}(X))) $$ where $\text{Mel - Spectrum}(X)$ represents the Mel frequency spectrum, and $\text{DCT}$ represents the discrete cosine transform. - **Spectrogram feature extraction**: The spectrogram is generated by short - time Fourier transform (STFT): $$ S(t, f)=|\text{STFT}(x