Abstract:Generating music with emotion similar to that of an input video is a very relevant issue nowadays. Video content creators and automatic movie directors benefit from maintaining their viewers engaged, which can be facilitated by producing novel material eliciting stronger emotions in them. Moreover, there's currently a demand for more empathetic computers to aid humans in applications such as augmenting the perception ability of visually and/or hearing impaired people. Current approaches overlook the video's emotional characteristics in the music generation step, only consider static images instead of videos, are unable to generate novel music, and require a high level of human effort and skills. In this study, we propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion from its visual features and a deep Long Short-Term Memory Recurrent Neural Network to generate its corresponding audio signals with similar emotional inkling. The former is able to appropriately model emotions due to its fuzzy properties, and the latter is able to model data with dynamic time properties well due to the availability of the previous hidden state information. The novelty of our proposed method lies in the extraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for users. Quantitative experiments show low mean absolute errors of 0.217 and 0.255 in the Lindsey and DEAP datasets respectively, and similar global features in the spectrograms. This indicates that our model is able to appropriately perform domain transformation between visual and audio features. Based on experimental results, our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets, and music generated by our model is also chosen more often.

Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation

Temporal conditional Wasserstein GANs for audio-visual affect-related ties

A Preliminary Study on GMM Weight Transformation for Emotional Speaker Recognition

Emotional speaker recognition based on similar neighbor phenomenon

Generative Adversarial Networks in Human Emotion Synthesis:A Review

Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System

Audio-Driven Emotional 3D Talking-Head Generation

A Model of Emotional Speech Generation Based on Conditional Generative Adversarial Networks

Emotional Speech Generator by using Generative Adversarial Networks

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Emotional Neural Language Generation Grounded in Situational Contexts

Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

Generation of Artificial FO-contours of Emotional Speech with Generative Adversarial Networks

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Speech-Like Emotional Sound Generation Using WaveNet

Towards Robust Deep Neural Networks for Affect and Depression Recognition from Speech

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

Can Generative Agents Predict Emotion?