Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation

Alice Baird,Shahin Amiriparian,Björn Schuller,Bjorn Schuller
DOI: https://doi.org/10.1109/mmsp.2019.8901785
2019-09-01
Abstract:The ability for sound to evoke states of emotion is well known across fields of research, with clinical and holistic practitioners utilising audio to create listener experiences which target specific needs. Neural network-based generative models have in recent years shown promise for generating high-fidelity based on a raw audio input. With this in mind, this study utilises the WaveNet generative model to explore the ability of such networks to retain the emotionality of raw audio speech inputs. We train various models on 2-classes (happy and sad) of an emotional speech corpus containing 68 native Italian speakers. When classifying the combined original and generated audio, hand-crafted feature sets achieve at best 75.5 % unweighted average recall, a 2 percent point improvement over the original only audio features. Additionally, from a two-tailed test on the predictions, we find that the audio features from the original speech concatenated with the generated audio features provides significantly different test result compared to the baseline. Both findings indicating promise for emotion-based audio generation.
What problem does this paper attempt to address?